Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Luoying Hao, Yan Hu, Wenjun Lin, Qun Wang, Heng Li, Huazhu Fu, Jinming Duan, Jiang Liu

Abstract

Recognition and localization of surgical detailed actions is an essential component of developing a context-aware decision support system. However, most existing detection algorithms fail to provide high-accuracy action classes even having their locations, as they do not consider the surgery procedure’s regularity in the whole video. This limitation hinders their application. Moreover, implementing the predictions in clinical applications seriously needs to convey model confidence to earn entrustment, which is unexplored in surgical action prediction. In this paper, to accurately detect fine-grained actions that happen at every moment, we propose an anchor-context action detection network (ACTNet), including an anchor-context detection (ACD) module and a class conditional diffusion (CCD) module, to answer the following questions: 1) where the actions happen; 2) what actions are; 3) how confidence predictions are. Specifically, the proposed ACD module spatially and temporally highlights the regions interacting with the extracted anchor in surgery video, which outputs action location and its class distribution based on anchor-context interactions. Considering the full distribution of action classes in videos, the CCD module adopts a denoising diffusion-based generative model conditioned on our ACD estimator to further reconstruct accurately the action predictions. Moreover, we utilize the stochastic nature of the diffusion model outputs to access model confidence for each prediction. Our method reports the state-of-the-art performance, with improvements of 4.0% mAP against baseline on the surgical video dataset.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_19

SharedIt: https://rdcu.be/dnwOT

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper presents a framework for surgical action recognition in surgical videos. The framework combines both spatial and temporal for action detection. Particularly, an anchor-context detection module is employed to relate the detections to the anchors. In addition, a class conditional diffusion model is also adopted to incorporate the prior knowledge for actions. The proposed framework was tested on a newly acquired cataract video dataset. And the results have shown the framework outperforms other state-of-the-art.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The application addressed in this paper fits into the interests of MICCAI community. The technical novelty of this paper is sufficient. Using both ACD and CCD in the framework has demonstrated better performance than other compared approaches.

    A new cataract dataset has been used in the paper. A range of state-of-the-art approaches have been included in the comparison. The evaluation of the approaches was conducted rigorously.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This reviewer is able to understand the modules/approaches employed in this paper. However, I still find that the writing quality of this paper is not sufficient. There are several places where very complicated sentences are used and are not really concise enough. However, the figures presented in the paper do provide some help on conveying the methodology in different places. Also the supplementary material is helpful.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code and dataset used in this paper will not be provided as indicated. This would not prove that the results are reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. I would like to suggest the authors to re-write the abstract section. It is currently really hard to read. For example, the second sentence in the section “However, most existing detection … which hinders application” should be rephrased. Also change “computer-assistant” to “computer-assisted” in In Introduction.

    2. Consider also change the title of Section to Methods/Methodology, which would be more aligned with the structure of the paper.

    3. Under the description of STAB. Can the authors be more clear about how the set Sj can be retrieved through sampling? How the factor C(ft) is chosen?

    4. The authors also mentioned STAM in Table 1 and in the Ablation Study. Is STAM same as STAB? Consider using the same term if needed.

    5. The dataset used in this paper contains 15 videos for training and 5 videos for testing. Why the authors do not have a validation set? Can the authors provide a clarification?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The technical novelty of this paper is good. The experimental section is well presented. The experiments were performed in a rigorous manner and a range of state-of-the-art have been included in the comparison study. The results have demonstrated the proposed framework outperforms others.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper presents a new model architecture for surgical action detection (classify and temporally localize activities in videos). The effectiveness of the approach has been validated on a Cataract surgical video dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Surgical action detection is a crucial building block technology for creating next generation context aware surgical systems and therefore, the paper is highly relevant to the community.

    The proposed method is validated on a clinical dataset and proven to be effective. The model architecture is well described and correct methods and metrics have been used for its validation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Video action detection is a well known task in general computer vision and surgical data-science community. The objective is to both classify and localize predefined actions in long videos. State of the art algorithms in this space are based on transformer architecture as a backbone (Swin transformer, etc) with an RNN (GRU, etc) for learning temporal action sequences. While this paper proposes a new architecture, it does not compare the results with such sota approaches and this is an important weakness. In addition, there are a few publicly available datasets that could be used to benchmark the proposed method against other techniques previously known to the community. Without such validation studies, the impact of this work is limited.

    Details on how to train this model are missing from the paper. Is this model end to end trained or there are multiple steps involved for the spatial part and the STAB?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    At the moment, the paper is not reproducible since the dataset is not public and the code is not available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Video action detection is an important task and multiple groups are working on it. In order to highlight the impact of your work, you need to benchmark it on known publicly available datasets and compare performance with methods already developed by other teams.

    Please add a few lines to the paper describing how the model can be trained and how many steps are involved. This will improve reproducibility.

    Please add a few lines to the paper describing important failure situations and challenges for it’s generalizability.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Method is novel and paper is well written. Authors need to compare with sota action detection methods and benchmark on at least one publicly available dataset.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    In this paper, the authors propose an anchor-context action detection network (ACTNet) that accurately detects fine-grained surgical actions that occur at each moment. The proposed approach includes an anchor-context detection module and a class conditional diffusion module to answer where the actions happen, what actions are, and how confident predictions are. The authors claim that their method reports state-of-the-art performance and improved 4.0% mAP against the baseline on the surgical video dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The proposed approach that includes an anchor-context detection module and a class conditional diffusion module provides a reliable surgical action detection method with accurate action predictions and their confidence. (2) The paper provides a detailed explanation of the proposed method, including a novel spatio-temporal anchor interaction block (STAB) to spatially and temporally highlight the context related to the extracted anchor. (3) The authors carry out comparison and ablation study experiments to demonstrate the effectiveness of their proposed algorithm based on cataract surgery.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) The paper lacks details on the experimental setup and evaluation of the proposed method, such as the number of surgical videos used in the experiments and the criteria for evaluating the performance of the model. (2) The authors only compare their proposed method with the baseline and do not compare it with any other state-of-the-art methods or discuss the limitations of the proposed approach. (3) The paper lacks clarity in explaining the proposed conditional diffusion-based generative model in reconstructing class distribution to provide accurate estimations and assess the model’s confidence.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    may not reproducible as the author didn’t release code

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Questions To Authors And Suggestions For Rebuttal: (1) Can the authors provide more details on the experimental setup and evaluation of the proposed method, such as the number of surgical videos used in the experiments and the criteria for evaluating the performance of the model? (2) Why did the authors only compare their proposed method with the baseline and not compare it with any other state-of-the-art methods or discuss the limitations of their proposed approach? (3) Can the authors provide a more detailed explanation of the proposed conditional diffusion-based generative model in reconstructing class distribution to provide accurate estimations and assess the model’s confidence?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel approach for anchor-context action detection in surgery videos. While the proposed method provides accurate action predictions and confidence, the paper lacks details on the experimental setup and evaluation, does not compare it with any other state-of-the-art methods, and lacks clarity in explaining the proposed generative model.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors present their work on an anchor-context action detection network (ACTNet) that seeks to leverage surgical video of cataract surgery to create action triplets (similar to work done in cholecystectomy though with a unique approach).

    Strengths: 1) The paper presents new work on anchor context for cataract surgeries and includes not just information on the action triplet prediction but also confidence behind a prediction. 2) There is novelty in their approach, including in the discussion of the spatio-temporal action block. 3) The authors’ ablation studies help provide a more complete understanding of the contributions of the different elements of their method.

    Weaknesses of this paper are somewhat outweighed by the strengths: 1) Additional information on model training is important to ensure reproducibility and to more fully understand steps required to achieve the reported results. Additional methodological details are important as highlighted by reviewers #1 and #3. 2) The authors report in table 1 an ablation that mentions “STAM” though this is the only use of this I can find in the paper. Do the authors intend to refer to STAB which is described in the paper? 3) Separate point is that the clarity of the paper, particularly the abstract and elements of the method and reporting, can be improved with additional grammatical review to correct typographical errors.




Author Feedback

We thank the reviewers for the high-quality comments. Below we provide point-to-point responses to the comments, which will be integrated into the final version. [Q] Experimental setup.(Meta/R1) [A] A training set with 15 videos and a testing set with 5 videos were used in the experiments. The performances are evaluated with official metric frame level mean average precision (mAP) at IoU = 0.1, 0.3, and 0.5. Due to the page limit of the paper, more details information has been listed in the supplementary material. [Q] Is STAM same as STAB? (Meta/R2) [A] Yes, the STAM is the same as STAB. We have corrected it. [Q] Rephrase the abstract. (Meta/R2) [A] We have rephrased the sentence in the abstract, as: However, most existing detection algorithms fail to provide high-accuracy action classes even having their locations, as they do not consider the surgery procedure’s regularity in the whole video. This limitation hinders their application. [Q] Performance comparison. (R1/R3) [A] We compare our method with the state-of-the-art method MViTv2 [1], which is published in 2022 and also applied to action detection. For comparative fairness, we choose the same object detector to detect instruments. The MViTv2 achieves mAP10 0.394, mAP30 0.387, mAP50 0.382 and mAPmean 0.388. Our method outperforms the MViTv2. Compared with the existing methods, we consider the spatio-temporal interactions with instruments in surgery and provide the uncertainty of predictions, which is significant in surgical scenes. For the datasets, so far, we have not found any publicly available surgical video dataset that includes labels of fine-grained actions and the location of actions. [1] Li, Yanghao, et al. “Mvitv2: Improved multiscale vision transformers for classification and detection.” CVPR 2022. [Q] More details of diffusion model. (R1) [A] For our conditional diffusion-based generative model, we first define a forward diffusion to transform the data into noise and then a reverse diffusion to regenerate the data from noise. Conditioned on the action classes and surgical videos, the model has a good ability to learn the distribution of action classes to recover accurate results. What’s more, due to its ability to produce stochastic outputs, the generative model is a preferable modelling choice to evaluate the uncertainty of predictions. We can sample N times and calculate the prediction interval width to indicate it. [Q] Description of Sj and C(ft)? (R2) [A] j is the index that enumerates all possible positions of the frame features ft, and Sj represents the set of all positions j. The response is normalized by a factor C(ft), which is calculated by ∑(Sj)h(ftj, it). [Q] No validation set. (R2) [A] For our dataset, thirdly splitting it into three sets (training, validation, and testing) would reduce the size of each set too much, and we choose the last epoch to evaluate our method. [Q] Training process. (R3) [A] Our implementation process can be split into three steps. The object detector is firstly used to detect the high-score instrument detections as the anchors. Then the STAB is trained to acquire action initial class distribution. Finally, after further refinement by CCD, we can get final more accurate action predictions (including action locations and their classes) and their confidence. We will also add more concise descriptions to the revised paper. [Q] Failure situations. (R3) [A] In the experiment, for some situations such as instruments that have similar appearance and actions with insignificant characteristics, it is very easy to make wrong predictions. In the future, work should be further increased to distinguish similar actions. At the same time, the model is currently only trained and verified on cataract videos, and there may exist challenges to generalizing to other surgical scenes. In the future, larger medical model frameworks can be introduced to increase generalization. We will also add more concise descriptions to the revised paper.



back to top