Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Junyang Wu, Rong Tao, Guoyan Zheng

Abstract

In this paper, we address the problem of estimating remaining surgical duration (RSD) from surgical video frames. we propose a Bayesian long short-term memory (LSTM) network-based Deep Negative Correlation Learning approach called BD-Net for accurate regression of RSD prediction as well as estimating prediction uncertainty. Our method aims to extract discriminative visual features from surgical video frames and modeling the temporal dependencies among frames to improve the RSD prediction accuracy. To this end, we propose to ensemble a group of Bayesian LSTMs on top of a backbone network by the way of deep negative correlation learning (DNCL). More specifically, we deeply learn a pool of decorrelated Bayesian regressors with sound generalization capabilities through managing their intrinsic diversities. BD-Net is simple and efficient. After training, it can produce both RSD prediction and uncertainty estimation in a single inference run. We demonstrate the efficacy of BD-Net on a public video dataset containing 101 cataract surgeries. The experimental results show that the proposed BD-Net achieves better results than the state-of-the-art (SOTA) methods.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_40

SharedIt: https://rdcu.be/cVRXf

Link to the code repository

https://github.com/jywu511/BD-Net

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper introduced a multi-task hybrid model, named BD-Net, for remaining surgical duration (RSD) estimation. In particular, multiple sub-models of the Bayesian LSTM with forced feature diversity were incorporated to estimate both RSD and uncertainty. In the end, the network achieved state-of-the-art results with RSD estimation on cataract-101 dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This work purposed a novel formulation for RSD. Based on the architecture of CNN+LSTM, the Bayesian LSTM(B-LSTM) is introduced and implemented with deep negative correlation learning method for this regression task.
    • The network is trained in a multi-purpose manner which perform phase classification, surgeon’s experience estimation, RSD estimation and uncertainty estimation simultaneously. The number of B-LSTM models allows estimation of uncertainty in a single sample during inference by setting different dropout probabilities for feature diversity with each model, rather than multiple sampling to detect uncertainty as in previous work.
    • A comparison on MAE results with three other state-of-the-art methods is provided with significant test showing a reliable result on 5min and 2min to the end RSD estimation performance.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The network is validated on a single dataset and surgery. The cataract surgery is showing to be generally short (5min-20min) where the performance on longer surgery such as cholecystectomy might need further justification in comparison with other methods that are designed for those surgery.
    • The surgeon’s experience estimation is described as a function of the network but with limited information provided in definition and setups for them.
    • Having a future work paragraph showing a developing direction of this model in conclusion section may be helpful.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This work was implemented on the public dataset cataract-101 and a reference implementation model is provided. Network settings and hyperparameters are detailed in the text to ensure good reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    It would be worthwhile to validate the method on other datasets, such as Cholec80. The surgeon experience labels are not necessarily present in other datasets, so testing the method without predicting surgeon experience could also be considered. From now, the method only estimates the remaining time to the end of the surgery. As different tools are needed for different phases, preliminary tool preparation is required in surgery, and the network does have the ability to classify phases, predicting the time remaining in each phase could be a potential development for this BD-Net.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well-organized showing a new approach for RSD prediction. The results are verified with good evidence in a public surgical dataset suppressing the state-of-the-art methods. The idea of averaging the predictions of multiple diversified submodels similar to the multi-headed mechanism shows good potential in this multi-task design.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a novel model for predicting remaining surgery duration in cataract surgeries. By using deep negative correlation learning, the model can estimate uncertainty in a single inference step. The proposed method outperforms state of the art and the authors show promising plots which indicate effective uncertainty estimation. Uncertainty estimation is especially important in this task due to the inherent ambiguity of future events.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed method is simple and achieves strong results, even without phase/expertise labels (as opposed to CataNet)
    • The authors use a statistical test to demonstrate the significance of their results
    • Uncertainty estimation is likely very useful for future-prediction tasks like RSD prediction. To my best knowledge, previous RSD methods have not considered uncertainty.
    • The uncertainty-related plots (Fig. 2(A) + suppl. Fig. 2) look very promising.
    • The proposed architecture is builds on the one in CataNet as it used the same backbone and same-size LSTM and is therefore directly comparable. (However, this could maybe be made clearer in the paper.)
    • Also, the training procedure is similar but simpler than the one from CataNet (essentially the first 2 stages from CataNet).
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. All the uncertainty-related results (Fig. 2(A) + suppl. Fig. 2) are restricted to selected surgeries - making it difficult to understand the average-case performance.
    2. The qualitative results (Fig. 2(A)) only seem to show easy examples with similar duration.
    3. It is not clear how the variance-maximization objective of DNCL is compatible with uncertainty estimation.
    4. It would be very helpful to include an ablation without both DNCL and Bayesian LSTMs (i.e. only phases/expertise).
    5. It is not clear from the paper if 6-fold cross validation was used like in CataNet. If not, then the SOTA scores in Table 1 are not completely comparable.

    The weaknesses and possible solutions are elaborated in more detail in section 8.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The method description is mostly very clear and detailed. The authors also promise to publish their code.

    Some details could be clarified:

    • What kind of statistical test was used to obtain the p-values?
    • Was 6-fold cross-validation performed like in CataNet or were all 81 videos used for training? In the former case, what validation metric was used to select the best model after training (e.g. MAE-ALL or loss)? In the latter case, how often were experiments repeated?
    • How were the MAE scores computed from multiple models (either through cross-val or repeated experiments)? Are predictions from multiple models averaged to compute one score or are multiple scores computed and then averaged?
    • It appears that the authors report the mean and standard deviation over surgeries but this is not specified in the paper.
    • I am not sure what exactly the ablation ‘w/o DNCL’ means. Is simply the loss term from eq. 3 removed or is it a non-ensemble variant with only one LSTM prediction head?
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    MAIN WEAKNESSES

    1. Uncertainty evaluation
      • All the uncertainty-related results (Fig. 2(A) + suppl. Fig. 2) are restricted to selected surgeries. This makes it difficult to understand the performance of the average case and not just the best.
      • SUGGESTION: E.g. all pearson r’s could be presented in a single plot in the supplementary material.
    2. Qualitative evaluation
      • The example surgeries in Fig 2(A) are all ca 5 minutes long and are thus probably rather easy examples.
      • SUGGESTION: It would be more insightful to also include more difficult examples (e.g. longer surgeries) and failure cases.
    3. Variance of predictions is maximized
      • The DNCL loss explicitly enforces the model to have high-variance predictions. How is this compatible with uncertainty estimation? The model should be encouraged to have low variance when it is “certain” about the remaining time. The qualitative results indicate that the model does have low-variance predictions in certain cases and that uncertainty decreases over time (as expected).
      • One explanation could be: Since negative RSD predictions are implausible and thus discouraged during training, the predictions automatically have lower variance when the predicted RSD is lower. However, this could mean that uncertainty predictions look better than they are. The DNCL objective might make it difficult for the model to give meaningful, input-specific uncertainty estimates and instead simply give higher variance with higher RSD values. Fig. 2(A) actually suggests that this might be happening.
      • SUGGESTION: How do the authors justify the use of DNCL for uncertainty estimation? Has this been done in previous work and if so, how is it justified there? This should be discussed more in the paper.
    4. Missing ablation
      • SUGGESTION: It would be very helpful to include an ablation without both DNCL and Bayesian LSTMs (i.e. only phases/expertise) since it seems like these two components are most effective if used together. So using only one could potentially degrade performance. E.g. without dropout in the LSTMs, the variance-maximization might be difficult to achieve.
      • Adding this ablation would make the combined contribution of “DNCL + Bayesian” clearer.
      • Additionally, this ablation would be very similar to CataNet but would be more comparable to the proposed model since the same training and evaluation scheme was used.
    5. Results possibly not completely comparable to SOTA
      • It is not clear from the paper if 6-fold cross validation was used like in CataNet. Since the SOTA results were directly copied from the CataNet paper and not reimplemented, the scores might not be entirely comparable if all 81 videos were used for training. However, adding the missing ablation from ‘Weakness 4’ would alleviate this problem.

    REQUIRED CLARIFICATIONS

    • The open questions from the reproducibility section could be clarified.

    MINOR COMMENTS

    • There are some typos: “annotaitons” (section 1), “incorporates” (section 2)
    • In the abstract, the phrase “we deeply learn” is a bit unusual. Maybe “we learn a pool of deep, decorrelated …”?
    • What exactly is meant by the “bias-variance-covariance tradeoff”. Maybe the authors can elaborate this in more detail in the paper.
    • The LSTM visualization in Fig. 1 appears to be based on Colah’s blog (https://colah.github.io/posts/2015-08-Understanding-LSTMs/). Maybe this should be referenced.
    • The term “end-to-end” in the caption of FIg. 1 might be misleading since the model is not trained end to end.
    • Adding statistics regarding the durations of videos would be especially helpful for RSD tasks. What is the mean, median, standard deviation? Or maybe a complete plot of durations could be added to the supplementary material if space permits this.
    • The hyphen in the evaluation metrics “MAE-ALL”, “MAE-5” and “MAE-2” appears to be formatted as a minus in the paper. Maybe using “\text{}” in math mode or not using math mode at all when not necessary would make this look nicer.
    • By naming the ablations by their missing components, I sometimes found it difficult to interpret the ablation results. maybe by representing the ablations with checkmarks it might be easier to follow which components are used and which are missing (e.g. in the style of Table 4 of https://arxiv.org/pdf/1904.07601.pdf)
    • In section 3 (experimental setup), the authors state that videos range from 5 to 20 minutes. However, Fig. 2 (A) shows two examples with durations of less than 5 minutes. Maybe the authors could provide more precise min and max durations.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Overall this is an interesting paper which achieves strong results with a simple model. Additionally, it addresses an important topic in RSD prediction (uncertainty estimation).
    • Some questions regarding the evaluation should be clarified (cross validation?) and the proposed ablation study (w/o both DNCL and Bayesian) would be helpful in my opinion.
    • I find it hard to understand why the variance-maximization objective of DNCL is useful for uncertainty estimation but the results look promising. I believe this should be discussed more in the final version.
  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #6

  • Please describe the contribution of the paper

    This paper proposes a novel BD-Net for residual surgical duration prediction, which ensembles multiple Bayesian LSTMs via deep negative correlation learning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. BD-Net achieves better results than the SOTA methods when applied to predicting RSD for cataract surgery.

    2. This work validates the effectiveness of Bayesian-LSTM and DNCL for improving generalization ability on RSD prediction task.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors state that they follow the Bayesian-CNN, DNCL, and uncertainty estimation methods in [13-19]. When describing the adapted methods for RSD prediction task in this artical, the authors should give more and clearer explanations, instead of forcing the readers to read the references [13-19] by themselves.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The experiments are conducted on public dataset. The code will be released.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    To improve the readability, the authors are suggest to detail the following contents.

    1. Describe the training and inference procedure by inserting an Algorithm, so that the dropout, ensemble, uncertainty estimation, etc. can be clearly introduced.

    2. Explain more about Eq. (5).

    3. Introduce how the default hyperparameters are selected, which hyperparameters are sensitive and should be finely tuned. Given some comparison of different hyperparameter settings if possible.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work provides a novel BD-Net for residual surgical duration prediction, whose performance surpasses the SOTA methods. However, although the proposed methods are derived from existing approaches, the method details and training procedure should be presented more clearly.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The reviewers agree that this is an interesting work which presents a novel model which outperforms the state-of-the-art. The paper would be improved by incorporating the clarifications suggested by the reviewers and in particular the points raised regarding the methodology, evaluation, and the ablation study.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

We thank meta-reviewer (MR) and all reviewers (Reviewer #2, #3, and #6) for their constructive comments.

Reviewer #2 Validation only on single dataset with generally short durations After submission, we also validated our methods on Cholec80 dataset, which contains videos of longer duration, and achieved the best results over the SOTA methods. We will report the results in our journal paper.

Reviewer #2 Surgeon’s experience is not clearly defined. As presented in our paper, we used the same dataset, i.e., cararact-101, and the same data split as in [9] in our experiments. Details about surgeon’s experience of this dataset were presented in [9]. This is why we did not repeat in our paper. Specifically, cararact-101 divides surgeon’s experience into two categories: junior and senior. The estimation of surgeon’s experience is modeled as a binary classification problem.

Reviewer #2 Future work paragraph showing a developing direction We will add it.

Reviewer #3 Average-case performance and difficult cases not shown Due to page limitation, we did not show all qualitative results in Fig. 2. We will include more results in supplementary materials to show the average-case performance as well as performance on difficult cases.

Reviewer #3 Compatibility between DNCL loss and uncertainty estimation Please note that our model’s uncertainty contains two terms: aleatoric uncertainty and epistemic uncertainty (see Eq. (5) for details). What R2 concerned is on the epistemic uncertainty (has a math format similar to DNCL objective). Our experimental results show that the aleatoric uncertainty is 2 – 3 times larger than the epistemic uncertainty. Additionally, according to Eq. (2), when the estimation error is small (or when our model is more certain), the aleatoric uncertainty cannot be too large due to the penalty in Eq. (2), which is consistent with our qualitative results as shown in Fig. 2.

Low variance does not mean that our model is “certain” about the remaining time as the mean estimation may deviate from the ground truth dramatically (i.e., not accurate such that the aleatoric uncertainty tends to be large). Low variance also indicates poor ensemble learning. Instead, we should aim for a regression ensemble where each base model is both “accurate and diversified”.

Reviewer #3 what ‘w/o DNCL’ means? It is a non-ensemble variant with only one Bayesian LSTM prediction head, i.e., non-ensemble variant with Bayesian learning.

Reviewer #6 Previous work not clearly explained Due to page limitation, we cannot present too much details about previous work but we do think that we have presented clearly relevant work in the Introduction Section.

Reviewer #6 Describe the training and inference procedure by inserting an Algorithm We will insert training and inference algorithm into our supplementary materials.

Reviewer #6 More explanation about Eq. (5) The method computing uncertainty is originally introduced by Kendall and Gal [19] which requires to draw multiple times of parameter samples. In contrast, as we used Bayesian LSTM with different dropout probabilities as our submodels, each submodel can be regarded as a sample. Hence we can use Eq. (5) to estimate uncertainty.

Reviewer #6 How default hyperparameters are selected and which are sensitive and should be finely tuned On the training set, we conducted 6-fold cross validation for model selection and hyper-parameter tuning. We found that hyperparameter lambda in Eq. (4) are sensitive and should be finely tuned.



back to top