Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Zhenqiang Li, Lin Gu, Weimin Wang, Ryosuke Nakamura, Yoichi Sato

Abstract

Automated video-based assessment of surgical skills is a promising task in assisting young surgical trainees, especially in poor-resource areas. Existing works often resort to a CNN-LSTM joint framework that models long-term relationships by LSTMs on spatially pooled short-term CNN features. However, this practice would inevitably neglect the difference among semantic concepts such as tools, tissues, and background in the spatial dimension, impeding the subsequent temporal relationship modeling. In this paper, we propose a novel skill assessment framework, Video Semantic Aggregation (ViSA), which discovers different semantic parts and aggregates them across spatiotemporal dimensions. The explicit discovery of semantic parts provides an explanatory visualization that helps understand the neural network’s decisions. It also enables us to further incorporate auxiliary information such as the kinematic data to improve representation learning and performance. The experiments on two datasets show the competitiveness of ViSA compared to state-of-the-art methods. Source code is available at: bit.ly/MICCAI2022ViSA.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_39

SharedIt: https://rdcu.be/cVRXe

Link to the code repository

bit.ly/MICCAI2022ViSA

Link to the dataset(s)

https://cirl.lcsr.jhu.edu/research/hmm/datasets/jigsaws_release/

https://endovissub-workflowandskill.grand-challenge.org/Data/


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper describes a method for analysis of videos to assess surgical skill. The network architecture includes a semantic grouping module that uses clustering of local semantic features.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method includes a data-driven approach to emphasize relevant semantic information for skill assessment. While the experiments illustrate instrument motion as the semantic information, the introduction claims that the method can isolate information beyond the instrument. The findings are expected, reiterate current understanding of relevance of information in video images for surgical skill assessment. The method is evaluated on two datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The evaluation metrics are not acceptable, despite citing previous works that used correlation as a metric. Correlation is perhaps one of the less relevant measures when evaluating model predictions. Mean absolute error is ok, but not sufficient. I don’t mean to undermine prior work on the JIGSAWS dataset that is listed in Table 1, but the sub-par choice of evaluation metrics by the community was likely because of insufficient input from collaborators with statistical expertise. There are no measures of variance, which makes the claims made in the paper unclear to me. Many of the claimed improvements may be within what is expected from sampling variance.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Measures of variance are missing.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    i) What is the architecture of the transformer for which findings are reported in Table 3? ii) What do we learn about the semantic groups in the HeiChole dataset? Do they emphasize only the instruments, how consistent are the groups based on visual examination?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Measures of performance are insufficient; measures of variance of estimates are missing. The method is interesting and a variant of existing methods that illustrate the relevance of known information.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    6

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The method is interesting, but the evaluation doesn’t make the cut. Reliance on previous art is not an adequate way to address limitations in the present work. However, the authors do not claim statistical superiority of the method, and instead describe it as competitive. Because the reviews will be released, I’m hopeful that the community will consider appropriate statistical methods, with input from those with expertise in evaluating proof of concept prediction models. One note I have for the authors is that there are typos, e.g., JIGSAWS spelt incorrectly, and the tables have acronyms that are not explained in the footnote, which makes it hard for the reader because they have to search for the explanations in the text.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a new framework called ViSA to predict the skill of surgical videos by discovering and aggregating different semantic parts. The framework has been compared to previous work and gets competitive performance on two datasets: JIGSAWS and HeiChole.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea of this work is interesting. Without supervision it can find different sementic parts in the surgical videos. The visualization results confirm this. This is helpful to discover structures from complicated surgery scenes.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) The writing needs improvement. For example, Fig. 3 in Sec. 3.3 mentions that “SGM facilitates the concentration on the task-related regions such as tools and discarding the unrelated background regions.” However, it is not clear how the conclusion was obtained. Does it mean that the authos conduct an comparision experiment that Grad-CAM combined with and w/o SGM?

    (2) As for temproal context modeling, have the authors considered other models besides LSTM and BiLSTM?

    (3) As for the number K, besides the ablation study on K=2,3,4, is there any more analysis and consideration for the choice of K? For example, does it depends on the scene complexity of the surgical videos? or sugery types?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the paper looks fine.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please see weakness.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall this is an interesting paper. Both the idea and experimental results are attractive. However, there is still concern about the writing and analysis of experimental resutls.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The authors propose a novel framework called ViSA which predicts addresses the problem of automated skill assessment in surgical videos. The authors state that the state-of-the-art models often do not capture semantic information as they often employ CNNs for short-term feature extraction and temporal aggregation networks (e.g., LSTMs) for long-term relationship modeling. They claim that global pooling over the spatial dimension on CNN features ignores semantic variance of different features. They propose to instead discover and aggregate different semantic parts of the surgical setting across spatiotemporal dimensions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors’ motivation to address the shortcoming of state-of-the-art models by discovering and aggregating different semantic parts of the surgical setting across spatiotemporal dimensions is an interesting approach.

    As the authors cluster spatial features, no supervision is required to discover these semantic features which are expected to belong to different surgical elements such as tools, the tissue, and the background. However, a supervised version is also proposed. This supervision is achieved with kinematic data.

    They experiment on the commonly used JIGSAWS, and another dataset HeiChole, which is in-vivo, therefore a more realistic one. The method proposed is able to achieve an improvement over the state-of-the-art models.

    The organization and the flow of the paper is good, as well as the technical writing.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The inverse kinematics are used as supervision in the Semantic Grouping Module (SGM), however, inverse kinematics are not sensitive enough to guide this process, and it can in fact be observed that the tool positions are not accurate, nor precise in the Fig. 4. What I understand is that the authors use inverse kinematics as to provide somewhat close to accurate supervision, and it seems to improve performance.

    No supervision is required to discover semantic features which are expected to belong to different surgical elements such as tools, the tissue, and the background. While this is interesting, it is achieved by clustering, and the number of the clusters is chosen somewhat arbitrarily with the belief that they will corralte to the surgical elements. In Fig 4, for example, we see that there is only a loose correlation.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors state that they will release the source code upon publication. They experiment on publicly available datasets: one of them is the commonly used JIGSAWS with predefined experimental setups and the other one is HeiChole, for which the authors include some details to their experimental setup.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    A discussion on using inverse kinematics for supervision while they do not provide accurate or sensitive enough supervision could be discussed.

    It is interesting that the proposed model uses clustering to aggregate semantic features, therefore does not need supervision. However, a note on how the K is somewhat arbitrarily chosen (with a belief that these features will relate to tools, tissue, and the background) should be added and discussed in the light of the resulting clusters which only loosely relate to these elements (Fig 4).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors’ motivation to address the shortcoming of state-of-the-art models by discovering and aggregating different semantic parts of the surgical setting across spatiotemporal dimensions is an interesting approach.

    As the authors cluster spatial features, no supervision is required to discover these semantic features which are expected to belong to different surgical elements such as tools, the tissue, and the background.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper presents a novel framework for assessment of skill from surgical videos, through explicitly splitting different semantics in video, and aggregating them across spatio-temporal dimensions. The approach allows for successful separation of semantic information in videos and providing explanatory visualization results. The framework has been tested on the JIGSAWS and HeiChole datasets, showing competitive performance. The approach is novel and interesting, and the evaluation experiments are thorough. The topic is of interest to the community. The main criticisms of the work are regarding the evaluation metrics and measure of performance, the discussion/analysis of the results, and missing details regarding experimental setup. The following points should be addressed in the rebuttal:

    • Justification regarding the evaluation metrics and performance measures used in the results (including lack of measures of variance)
    • Details and justification regarding experimental setup (e.g. number of clusters K) and how they are chosen.
    • Better discussion of the results, including the effects of the SGM, and clarifications around the use of inverse kinematics in the tool position supervision (and its limitations).
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3




Author Feedback

We appreciate all reviewers and meta reviewers for the constructive feedback. We are pleased that they find our work novel (Meta-reviewer and R3) and attractive (R2). We clarify the concerns summarized in meta-reviews as follows.

(1) Evaluation metrics and performance measures: R1’s primary concern is the measures of variance are missing. We agree this indeed brings up a unique perspective from a statistical point of view and we would enhance evaluation ability. We are very grateful that R1 also acknowledges that we ignored the measure of variance because we are following the reporting methodology of previous publications. They did not report the variance which is “likely because of insufficient input from collaborators with statistical expertise”. Due to the rebuttal policy, although the variance of network predictions can be straightforwardly computed, we are sorry that we cannot provide and promise new results about it. However, considering that we report the cross-validation results averaged on multiple runs, we believe that the performance improvement is not from sampling variance. Another concern of R1 is the usage of the ranking correlation. We choose ranking correlation as the metric for comparing with previous works because it has been the main metric since the JIGSAWS dataset is published and many previous works report the results merely on the metric. In our ablation study, we incorporate the MAE metric and we can find the results on MAE and ranking correlation are highly correlated. Therefore, we think the ranking correlation is not such a less relevant measure as R1 stated.

(2) Experimental setup: We thank R2 and R3 for the instructive suggestions about the clarification of cluster number K. We initialize K according to the scene complexity and fine-tune it based on the experiment results if needed. Since the surgical scene in our experiment is explicitly represented by tools, tissues, and background parts, the model with K=3 can generally assign features to the three kinds of semantics, which is consistent with our expectations. K does need to be adjusted if scenes are more complex with the surgery type changing. As pointed out by R3, when we aggregate features to the specified number of groups without supervision, the resulting groups may not perfectly correspond to the expected semantics. This is more observed in the HeiChole dataset due to the variety in tissue appearance and tool gestures (R1). We think this can be corrected to some extent by incorporating supervision information as we discussed in the paper. As for the temporal context modeling (R1 & R2), we also experimented with the transformer structure as reported in Tab.3, which has two transformer layers consisting of LayerNorm+Multi-Head Attention+MLP.

(3) Discussion of the results: Regarding the effects of SGM (R2), the understanding of R2 is right. we reach the conclusion of concentration improvement mainly based on the qualitative comparison of explanation results given by Grad-CAM, which visualizes the regions decisive for the model prediction in heatmaps. As shown in Fig. 3, compared with the explanation results of the model without SGM, results of using SGM come to highlight more on the tool regions and be less attentive in the background regions. For the use of inverse kinematics as auxiliary supervision (R3), we expect it can improve the performance by explicitly assigning a semantic clue of tools to a certain feature cluster and correcting the imperfectly clustered results. Although the provided inverse kinematic supervision is not fully precise in the experiment, we consider it is helpful for the network to discover the features of tools from other features as Fig.4 indicates.

We will add the clarification and discussion about the above points and incorporate the suggestion from reviewers in the subsequent version.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The topic is of interest to the CAI community, the paper is well-written, the approach is novel and thoroughly validated. The main concerns of the reviewers regarding the evaluation metrics, experimental setup, and discussion of results have been addressed in the rebuttal, and should be incorporated in the final submission.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors responded adequately to the reviewers’ comments.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    All reviewers agree that the proposed paper introduces an interesting method for surgical skill assessment. Some concerns are pointed out during the first stage but the authors successfully address them during the rebuttal. I would recommend acceptance of this paper based on the consensus of reviews.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1



back to top