Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yitong Zhang, Sophia Bano, Ann-Sophie Page, Jan Deprest, Danail Stoyanov, Francisco Vasconcelos

Abstract

In minimally invasive surgery, surgical workflow segmentation from video analysis is a well studied topic. The conventional approach defines it as a multi-class classification problem, where individual video frames are attributed a surgical phase label.

We introduce a novel reinforcement learning formulation for offline phase transition retrieval. Instead of attempting to classify every video frame, we identify the timestamp of each phase transition. By construction, our model does not produce spurious and noisy phase transitions, but contiguous phase blocks. We investigate two different configurations of this model. The first does not require processing all frames in a video (only <60% and <20% of frames in 2 different applications), while producing results slightly under the state-of-the-art accuracy. The second configuration processes all video frames, and outperforms the state-of-the art at a comparable computational cost.

We compare our method against the recent top-performing frame-based approaches TeCNO and Trans-SVNet on the public dataset Cholec80 and also on an in-house dataset of laparoscopic sacrocolpopexy. We perform both a frame-based (accuracy, precision, recall and F1-score) and an event-based (event ratio) evaluation of our algorithms.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_47

SharedIt: https://rdcu.be/cVRXm

Link to the code repository

https://github.com/yitongzh/TRN_code

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a new RL formulation for offline phase transition retrieval. Specifically, a network TRN is proposed which searches phase transitions using multi-agent RL. The proposed method is validated on Cholec80 and an in-house dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea of using a multi-agent RL to find phase transition is interesting.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) The experimental validation is not convincing. In Table 2, the last two rows only list the performance of the proposed method under not full coverage which is not convincing. It is suggested to show the performance of the proposed method under full coverage. It is also suggested to compare with more previous work.

    (2) Reference [1] ‘s information is not given. It is suggested to give this information.

    (3) Since it is well known that RL methods take more computational resources for training. It would be good to compare the computational efficiency of the proposed method to previous work besides the performance comparison.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the paper looks fine.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please see Weakness.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The insufficient experimental validation is my major concern.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    The manuscript proposes an offline phase transition retrieval method using reinforcement learning. The method predicts phase transition timestamp instead of classifying all frames. The method is evaluated in two settings - first with a sparse number of frames resulting in subpar results than SOTA; the second where all frames are processed and claim to outperform SOTA which are TeCNO and Trans-SVNet. Evaluation is measured at frame level (accuracy, precision, recall, and F1-score) and at event level (event ratio).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The abstract is clear and well written.
    • There is a clear contribution in this paper.
    • The results were compared with the relevant SOTA models (TeCNO, Trans-SVNet) in the domain.
    • The paper presents significant insights for research continuation.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The novelty of the method is limited, and lacking in details.
    • Paper title does not match the content and work presented.
    • The results are below the baseline on one dataset which makes its hard to justify the efficacy and generalization capability of the proposed method.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper provides enough information for this.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Title matching content: No, the manuscript describes a method to correctly predict the phase transitions and that itself is a “refinement” task over the initial set of predictions rather than a “retrieval” task.

    2. Abstract summarizing content: The abstract mentions the use of the reinforcement learning paradigm but did not fully justify the rationale behind their choice. The abstract talks about two configurations with full and sparse number of video frames, however, the paper does not elaborate on the usefulness of the two settings beyond just computational efficiency The abstract mentions the “comparable computation cost” of the new method but it is not specified in any form in the paper. It will be good to provide numbers for the computation cost as well.

    3. Motivation: The motivation for the work is short although clear and matches the goal provided in the abstract. However, it will be great to provide more details about the implications of “erroneous phase transitions”. The readers are left looking for more details on the said problem.

    4. Novelty in contribution: The novelty lies in adapting Deep Q-Learning Network (DQN) and Gaussian components, but more details on the rationale behind DQN use would have been better.

    5. Knowledge advancement: The work provides a method of reducing erroneous phase segmentation by the use of multi-agent DQN and Gaussian smoothing. The work outperforms the re-implemented SOTA work as baselines for the Cholec80 dataset but not for the in-house dataset. The weak performance (~10% less from baselines) on the in-house dataset is justified by the processing of less number of frames (~20%) which is encouraging but it’s defined on only one phase transition. More details on the dataset and for more phase transitions would have been better.

    6. Positioning with existing literature: The manuscript mentions recent works in the field of phase recognition but misses out on mentioning other papers such as MTRCNet-CL, Surgical phase recognition by learning phase transitions (Sahu et al.), etc. Related references are covered for DQN, phase recognition, datasets, but missing for LSTM, ResNet.

    7. Method description and rationales: The method is aptly divided into three modules with sufficient details and provides enough purpose for all the modules. The clip size in the Average ResNet feature extractor is set as 16. It will be nice to see the results for K=8 or K=32 to justify the K=16 choice. For example, the K=8 setting might make the predictions less noisy or more refined.
      The DQN Transition Retrieval subsection does not provide the reasoning for choosing the RL-based network compared to standard CNN/RNN based methods or possibly share some results to ascertain why RL is necessary. The second FC layer after DQN is 50-dim in size and is mapped to “2 Q-values Right and Left”. This part is confusing - is the final output vector is of dimension 2 or 50 and how is it divided into vectors responsible for “Right” and “Left” action. An example of a sample input and output feature dimension through the DQN + LSTM/FC setting makes it easy to comprehend which is not provided.
      Without giving the input dimension, it is important to mention the dimension of the downsample features. In the DQN subsection, two characteristics provided are similar as the position of the agents defines the nearby clips that will be used for training. There is a scope for further clarity in the text. RMI initialization is not clear from the text. Is there another ResNet trained with transition indices as the final output? More details should be provided on RMI as reported results are better in “RMI” setting than in “FI” setting. The motivation behind the third Gaussian composition module is clear. The loss in Algorithm 1 is not explicitly specified and there is no mention of the type of loss used. The data structure used for the replay memories is not specified - Is it a tensor? What is the size of the replay memories? Is the sampling random or sequential? The author refers to the RL paper [12] but it should mention small details as mentioned above.

    8. Standalone figures and tables: The problem statement in Fig 1 is clear and summarizes the goal of the paper. The architectural diagram in Fig 2 is clear and legible but missing small details like an arrow pointing from DQN to its expanded view.

    9. Reproducibility of the experiments: The basic hyperparameters are presented. Is there no weight decay used during training? Are both ResNet-50 and DQN trained end to end? This is not clear from the experimental setup section. The maximum number of steps for agent exploration is mentioned as 200. Is it because the network converges by 200 steps? Is there a lower and upper bound on the number of steps where convergence starts or stops? The manuscript should mention the maximum number of episodes used for training which is missing from the text.

    10. Data contribution/usage: The method is implemented/evaluated on a publicly available dataset - Cholec80. The paper uses the recommended train/val/test splits in the original dataset paper. A private in-house sacrocolpopexy dataset is used for evaluation however it’s focused on only one phase transition compared to multiple phase transitions in Cholec80.

    11. Results presentation: The results in the tables are clear but the best results must be highlighted in bold. The results are specified with mean and std which makes it easy to comprehend with others. The manuscript stresses the improvement in the performance based on the metrics - Event and Ward Event ratio which is promising but did not discuss further the case for Sacrocolpopexy where TeCNO/Trans-SVNet in spite of having a much higher F1-Score than TRN does not have better Event/Ward Event ratio. The manuscript mentions TRN21/41 FI for Sacrocolpopexy in the results and discussions but the results are provided only for TRN21/81 FI. To maintain uniformity, the manuscript should have presented the results for the RMI setting for Sacrocolpopexy which is missing, and no rationale is provided. One of the SOTA, Trans-SVNet, is said to be reproduced in this paper for comparison, but the reported numbers on the Precision and Recall metrics for Cholec80 are ~8-9% less than the published performance. This raises the question of the quality of the baseline used and reported in the paper.

    12. Discussion of results and method justification: The results for Sacrocolpopexy are not “slightly” under the baselines, the word manuscript used is misleading as the difference between TeCNO/Trans-SVNet and TRN is ~ 10% under F1-Score/Precision/Recall metrics. The improvement in the performance for Cholec80 is clearly mentioned and the reasoning provided. However, it is important to know which part of the TRN is largely responsible for smooth transitions. Is it because of DQN or the gaussian composition? For example, the caption for Table 1 is confusing - the results for individual phases do not perform Gaussian Composition but the Overall F1-score is after applying Gaussian Composition. What is the reason behind this?

    13. Clinical relevance of the proposed method and obtained results The clinical relevance of the work done is not discussed.

    14. Conclusion: The paper presents significant insights for research continuation Reference is adequate but ResNet and LSTM are not cited.

    15. Arguable claims: The videos are center-cropped but might miss out on surgical activities or motion patterns happening around the video frame. Most works resize the video frame rather than center cropping. The performance might also be stunted due to center cropping. (page.5). The manuscript should provide reasoning behind this.

    16. Manuscript writing and typographical corrections We implemented the standard DQN training framework for our [netwrok]: [network] (pg: 3) We perform a Gaussian composition of [of] the predicted phases: [] (pg: 4)

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There is a clear contribution which is relevant to the community. The results were compared with the relevant SOTA models (TeCNO, Trans-SVNet) in the domain. The release of the in-house dataset would add value to the research community

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The rebuttal feedback provides more clarity to the work done. These new details were in-line with my prior assumptions and justify my initial concerns and reviews. Hence, I do not change my manuscript’s initial rating.

    The author’s explanation for using DQN due to limited action space is valid however, there might be better alternatives (e.g A3C) that can minimize erroneous phase transitions. This makes it important to mention the effect of other RL-based methods in identifying phase transitions.

    Also, I agree that the RL-based method supports a low coverage rate (less number of frames for phase segmentation) deeming it useful for the scenarios where the data is sparse.

    I agree with the reasoning provided for the use of one phase in the Sacrocolpopexy dataset as some phases might be more relevant to capture than other phases (e.g. dissection in cholec80).

    The experimental settings provided in the rebuttal are now detailed and should be incorporated into the original manuscript if accepted. The details for the loss, shapes of the prediction logits from LSTM to actions, DQN specific modeling have been made clearer.

    I did not observe any response to missing citations for ResNet and LSTM. It is un-academic to intentionally omit relevant citations.

    On a general note, surgical phase recognition is a widely researched area with almost saturating recognition performance in recent works. The use of reinforcement learning on this task is lacking in the literature and adds to a vast list of unexplored methodologies and their unexplored strengths. While this earlier research might be lacking in details, quality, and performance, such a baseline is needed as a ridge for richer analysis in the future. So to say, this work adds value to the research community.



Review #3

  • Please describe the contribution of the paper

    The authors propose a novel method for offline surgical phase recognition using reinforcement learning. While most work views phase recognition as a frame-wise classification task, the authors rather define the task of finding the start and end point of each phase. This way, predicted phases are supposedly guaranteed to be contiguous. Two different initialization strategies are proposed which either use all (RMI) or only a subset of the video frames (FI). Methods are compared to SOTA online methods on 2 different datasets. The RMI-variant achieves superior performance on the cholec80 dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Reformulating offline phase recognition as the task of finding phase transitions is a novel approach and more closely resembles how humans would solve this task.
    • The proposed task/method has the advantage that it (supposedly) produces contiguous phases.
    • The authors discuss limitations of the approach.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors propose an offline method but only compare to online methods
    2. There are several open questions regarding the method design which indicate limitations of the proposed task formulation. These open questions are mostly related to how the model behaves in edge cases (phase does not occur, phase is predicted inside another, average frame index is out of range, constraint “f_nb < f_ne”).
    3. The way windows are traversed by the agent might make the task unnecessarily difficult.
    4. Why was DQN chosen since there are many newer RL methods?
    5. It is not clear how the modification of the baselines affects performance.

    The weaknesses and possible solutions are discussed in more detail in section 8.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    For the most part, the method and training procedure is described in detail. The authors also indicate that they intend to publish their code.

    Some details could be clarified:

    • How many epochs and episodes were used for the ResNet backbone and the DQN respectively.
    • The authors say they used a validation set for model selection. Which metric was used here?
    • There are some open questions regarding the design of the method which are elaborated in more detail in section 8 (main weakness 2)
    • How exactly are the metrics computed? E.g. how are NaN values in precision and recall handled if a phase is not predicted or does not occur? Were the metrics with relaxed boundaries used like in Trans-SVNet and other previous work (e.g. TMRNet, MTRCNet-CL, SV-RCNet)? Other metrics are fine but it should be made clear how they were computed.
    • Were experiments repeated? This is not clear since the standard deviation is computed over videos.
    • If they were repeated, how were scores computed? Are predictions first averaged to compute one score or are scores computed for each prediction and then averaged?
    • For the RMI approach, the authors state that the indices of all possible transitions are averaged to initialize the agents. How are “possible transitions” defined?
    • The authors state they used window sizes of 21 and 41. Is this L? Or is it the complete receptive field (i.e. window size of 21 = 2*L+1 with two search windows of L=10 plus the center frame)? Section 2.1, however, states that the state of each agent consists of 2L features, not 2L+1.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    MAIN WEAKNESSES

    1. Offline method only compared to online methods
      • The authors propose an offline approach but only compare their results with online methods. Since offline phase recognition is a considerably easier task than its online counterpart, it is not clear if the performance gain is because of the effectiveness of the proposed RL method or simply due to the easier task.
      • TeCNO and Trans-SVNet should be fairly straighforward to reimplement as offline methods (especially TeCNO).
      • SUGGESTION: I believe the authors should compare with offline methods (e.g. offline TeCNO, offline Trans-SVNet) to demonstrate whether the proposed RL formulation is competitive with the standard frame-classification formulation.
    2. There are several open questions regarding the method design. These questions partially indicate limitations of the proposed task formulation:
      • The paper does not mention what happens if a phase does not occur in a video. If the model cannot handle this case and always predicts all phases, this would be a major limitation.
      • The authors state that their method guarantees contiguous phases. However, what happens if the start and end points of one phase are predicted to be within another phase (e.g. f_1b < f_2b <_f2e < f_1e). Due to the Gaussian composition, this would result in the ‘outer’ phase to be split into two segments. How is this case handled? Has this happened in any video?
      • Is the constraint “f_nb < f_ne” somehow enforced by the model? How would the model behave if this constraint was violated? Has this ever happened?
      • For the FI approach, the authors state that transitions are initialized at the “average frame index”. For short surgeries and late phases, this average frame index likely often lies outside of the range of the video. How are these cases handled? Or is it rather a relative frame index is measured (i.e. the average progress of the surgery in percent)?
      • IDEA: Making the agents predict “f_nb > f_be” might be a way of handling missing phases and might kill two birds with one stone. Not sure if this is a good idea.
    3. The way windows are traversed by the agent might make the task unnecessarily difficult.
      • How are the windows traversed by the LSTM agents? If they are traversed sequentially, then the most relevant frames are likely somewhere in the middle of that sequence. The LSTM might forget relevant information or it might be difficult to remember their exact location.
      • E.g. if the agent is currently at the correct location, the LSTM would have to remember that the transition happened at exactly the middle of the sequence. This seems like an unnecessarily difficult task. Adding a positional encoding or traversing the sequence from outside to inside might be alternative strategies.
      • Why was this sequential strategy (supposedly) chosen? Did the authors test or consider other traversing/encoding strategies?
    4. Why was DQN chosen?
      • The standard DQN algorithm is quite old and many improved or different RL approaches already exist? Why did the authors not opt for more modern RL methods like PPO[1], SAC[2], A3C[3] or HER[4]?
      • SUGGESTION: The authors should explain why they chose DQN or consider a more modern RL method.
    5. The authors modify the baseline methods but it is not clear if this modification improved or hurt performance.
      • Modifying the baselines to be more comparable to the proposed approach is definitely a valid approach.
      • Nevertheless, the original approach should still be reported to understand how this modification affected performance.

    [1] https://arxiv.org/abs/1707.06347 [2] https://arxiv.org/abs/1801.01290 [3] https://arxiv.org/abs/1602.01783 [4] https://arxiv.org/abs/1707.01495

    REQUIRED CLARIFICATIONS

    • The open questions from the reproducibility section and ‘main weakness 2’ could be clarified.

    MINOR COMMENTS

    • Apparently a mistake happened in the supplementary material. All plots show the results of the same video.
    • Why was the RMI approach not evaluated on the sacrocolpopexy dataset?
    • It is quite cumbersome to understand how big the receptive field of the model is in terms of seconds. If I understand correctly if would be 2L16 / 2.4 seconds (with a window size of 2*L feature vectors; the averaged ResNet producing 1 feature vector from 16 frames and an initial framerat of 2.4 fps). Maybe this could be made clearer in the paper? Or if I am incorrect, the correct receptive field could be given.
    • Typo: “netwrok” in Section 2.1
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors present an interesting new task formulation for offline phase recognition. However, while the idea is promising, there are many open questions regarding the method’s design which indicate poor behavior in edge cases. E.g. it does not seem like the model can handle missing phases. Another weakness is that the results are only compared to online approaches (which is a considerably harder task). It is not clear if the proposed RL formulation could compete with the offline variants of standard phase recognition models (e.g. TeCNO or Trans-SVNet).

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The authors proved a very good and precise rebuttal. I am changing my score from “weak reject” to “weak accept”. Due to some limitations in evaluation and method design I would also understand a rejection but overall I think it is an interesting paper from which the community could benefit.

    Most of my concerns were well addressed:

    a. Baseline methods (Trans-SVNet, TeCNO) were modified (weakness 5 in my review): This might not be a perfect SOTA comparison, but I understand the argument of having better comparability. The paper proposed an interesting new task formulation that is very different from previous approaches. I believe it is not necessary to immediately achieve SOTA scores in every setting, especially since years of research have been used to finetune these frame-classification approaches while the RL approach is new and could be pushed towards SOTA in future work.

    (Addition: As other Rs mentioned, the RMI experiment should be added to the Sacro dataset. In my opinion, even inferior scores would be acceptable for the reasons above. It should be noted that the authors chose to add the additional dataset with lower scores. This is good practice and should not be discouraged in my opinion.)

    b. Offline vs online (w1): This was a misunderstanding, since authors implemented offline variants of Trans-SVNet and TeCNO. However, unless I missed it, this was not mentioned in the paper. The authors should clearly indicate that offline variants were used.

    c. Method design and edge cases (w2): Addressed well in the rebuttal but should be described clearer in the paper.

    • “f_nb < f_ne”: Resolved but should be mentioned.
    • “Avg. frame out of range”: same
    • “Phase inside another”: This limitation is not critical. However, the authors state that their method guarantees contiguous phases which is not entirely true. This claim should be reformulated.
    • “Missing phases”: Remains but has little effect on performance. It would suffice to add this to the limitations section.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a novel framework of using reinforcement learning for identifying transitions between phases, for the task of surgical workflow/phase segmentation. The approach has been validated against SOTA on the Cholec80 dataset and an in-house dataset of laparoscopic sacrocolpopexy.

    The main criticism of the work are related to the experimental validation, the method design, and missing details in experimental setup The following points should be addressed in the rebuttal:

    • Justification for experimental validation and results (including comparison of offline approach to online methods, performance comparison under not full coverage, and computational efficiency of proposed method)
    • Limitations of the proposed approach/methodology and behavior under edge cases
    • Missing details in the experiments (including details related to reproducibility, and traversing/encoding strategies)
    • Justification for choice of methodology (DQN)
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9




Author Feedback

Offline vs online methods
TeCNO, TransSV were run in offline mode by disabling causal convolution in TCN and providing all frames as input. Their code has a flag for this.

Coverage rate
This is not a controllable parameter, but a performance metric (low value = better; full = 1 = worst case). It’s the avg. % of frames needed to produce a complete workflow segmentation. Frame-based methods always need 100% of frames, but TRN FI avoids this, which is a unique RL advantage and a core contribution.

Choice of DQN We have discrete action space with only 2 actions. This is a very simple exploration space by RL standards and a lightweight DQN is enough. Modern methods would be better for simultaneous estimation of all phases (see future work in conclusions), due to higher dimension of action space and complexity of exploration.

Limitations & edge cases
Missing phases: This happens in Cholec80 test videos. It has little impact. The RL agent makes begin and end labels converge towards the same/consecutive timestamps. Sometimes there are still residual frames erroneously predicted as the missing phase. These errors are counted in reported stats and the reason we do not always have perfect event ratio. f_nb < f_ne: Actions that make f_nb > f_ne are not allowed in our RL environment, so it does not happen.
Avg. frame out of range: transitions are initialized as % of video duration, not absolute index, so it does not happen. Phase inside another: we agree this can happen in theory. It never happened in cholec80 test videos.

TRN below baseline in Sacro dataset Only in 1 out of 3 metrics. TRN is better in event ratio (no noisy transitions) and has significantly lower coverage rate (thus it’s faster).

Inference time in avg. seconds per video
Cholec80: ResNet50 96.6, TeCNO 99.6, Trans-SVN 99.6, TRN21 FI 60.6, TRN41 FI 64.9, TRN41 RMI 105.5 Sacro: ResNet50 493.7, TeCNO 493.8, Trans-SVN 493.9, TRN21 FI 78.1, TRN81 FI 104.0

Trans-SV scores lower than original paper
Note that TeCNO paper also has different scores vs TransSV paper, especially std, possibly due to different video subsampling / different ResNet weights, so there’s no unified methodology to copy. For fairness, all methods are run with same subsampling (2.4 fps), cropping/resize, and ResNet weights. Changing these could improve results, but it applies equally to all methods, so relative comparisons are valid.

Why only 1 phase in Sacro dataset:
It is of clinical interest. Studies claim replacing suturing with other techniques is faster and easier to learn (e. g. Lambin et al. Glue mesh fixation in laparoscopic sacrocolpopexy: results at 3 years’ follow-up). 1 phase detection allows large scale measurements for this. It is also an example of the best-case scenario for FI in terms of computational gains, which adds new insights not visible in cholec80.

No RMI for Sacro Sacro is more interesting for FI (single phase in long videos). It maximises the computational gains (very low coverage), so we focus the analysis on this.

LSTM Traversing Sequential. Forgetting is not a massive problem due to short sequences (21, 41 frames). The re-ordering suggestions are interesting.

More details: Max number of DQN agent steps: 200. No early stopping. Same ResNet is used for RMI, search window, TecNO, TransSV RMI methodology: classify all frames w/ Resnet, extract all begin & end transitions for each phase, average timestamps. DQN sizes: Reshaped LSTM outputs 20L dim, FC1 maps 20L -> 50; FC2 maps 50 -> 2
Algorithm1: Huber loss Replay memory: 10000 item per agent, each item contains 4 tensors (state, action, next state, reward), sampled randomly with 128 batch size in training Weight decay: no ResNet & DQN are trained separately. ResNet-50: 100 epochs DQN: 20 episodes TRN FI 41/81: In Sacro we use 81, in cholec80 we use 41. Window size L: full window length. State has 2 windows, beginning and end, thus 2L. Repeated Experiments: no




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposes a novel approach to the problem of surgical phase recognition/segmentation through reinforcement learning, making a relevant contribution to this widely researched field in the CAI community. The main concerns of the reviewers around the validation strategy and baseline methods as well as method design have been addressed by the rebuttal.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors responded adequately to the reviewers’ comments. Although there are still weaknesses in this work, the reviewers agree that there is still merit and it proposes an interesting research direction.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper introduces a novel direction of retrieval of surgical phase transitions using reinforcement learning. The authors addressed the major issues from the reviewers during the rebuttal. Although the current version of the paper still has weaknesses, two reviewers agree that this paper has merits for the CAI community. I think this paper has more pros than cons. Therefore I would also vote for the acceptance of this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



back to top