Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Xinpeng Ding, Xiaowei Xu, Xiaomeng Li

Abstract

Thoracoscopy-assisted mitral valve replacement (MVR), an important treatment for mitral regurgitation patients, requires higher surgical skills to prevent avoidable complications and improve patient outcomes. Hence, the surgical skill assessment (SKA) for MVR is essential for surgical training and certification for novice surgeons. Current automatic SKA approaches suffer from several inherent limitations,~\eg, no public thoracoscopy-assisted surgery datasets, the ignoration of inter-video relations, and restricted to SKA of a single short surgical action. In this paper, we collect a new clinical dataset for MVR, which is the first thoracoscopy-assisted long-form surgery dataset to the best of our knowledge. Unlike a short clip that only contains a single action, videos in our datasets record the whole procedure of MVR consisting of multiple complex skill-related surgical events. To tackle the challenges posed by MVR, we propose a novel baseline named \textbf{S}urgical \textbf{E}vent \textbf{D}riven \textbf{S}kill assessment (SEDSkill), a long-form and surgical event driven method. Compared to current methods that capture the intra-video semantics for the global video, our proposed SEDSkill contains a local-global relative module to learn the inter-video relations for both global long-form and local surgical events correlated semantics. Specifically, an event-aware module is designed to localize skill-related events from long-form videos automatically, thus extracting the local semantics. Furthermore, we introduce a relative regression block to learn the imperceptible discrepancies for accurately assessing surgical skills. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art approaches in the MVR scenario.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_4

SharedIt: https://rdcu.be/dnwOE

Link to the code repository

https://github.com/xmed-lab/SEDSkill

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a skill assessment method that incorporates a local-global relative framework with attention mechanism, based on a new thoracoscopy-assisted surgical video dataset. The proposed method utilizes a basic regression module to predict the skill score for a video, and a local-global relative module to predict relative scores for video pairs. Good results are achieved on the proposed dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The idea of aggregating the local (i.e., event-level) and global (long-form video) semantics is inspiring. To focus on the skill-related parts and remove the less informative ones, localizing the key temporal events is important. The proposed local-global relative module learns between action pairs and captures inter-video information. 2) Recognizing skill-related surgical events and utilizing the information is quite reasonable for skill assessment. The skill score of a surgical video is closely tied to the occurrence of inappropriate operations. Combining imperceptible discrepancies recognition and surgical skill assessment is a practical way to promote the results. Experimental results and qualitative results have shown the effectiveness of such idea. 3) A new dataset for surgical skill assessment is collected. This is valuable for the field. Temporal annotations for key skill events are provided. 4) Good experimental results are achieved on the newly proposed dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Though effective, the ideas of both combining temporal information in skill assessment and learn local-global information by video pairs are not novel. The overall procedure and process shares similarity with action quality assessment methods. Please refer to “FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment”. 2) The motivation of encouraging the model to focus on the skill-related parts and remove the less informative ones is not fully verified. From the perspective of spatial information, the visual features are similar in a surgical video. Therefore, capturing key components and variance in spatial space is also important, which is not mentioned in this paper. 3) (Minor) The number and duration of imperceptible discrepancies are manually added, which may influence the reproducibility of the method on other similar tasks. 4) Some expressions may cause ambiguity, for example, -The meaning of inter-video relation could be clarified. -“Fig.2(b)” mentioned in Paragraph 1, Section 2.2 does not exist. -“4 values” mentioned in Paragraph 2, Section 2.2 actually indicates the number of event categories, which is clarified in Paragraph 1, Section 3. I would suggest replacing “4” with “C”, and clarify “C=4” in Section 3.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code will be available but the dataset is not sure.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    1) The authors are suggested to add comparisons with the method in “FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment” and discuss the difference between the proposed method and FineDiving. 2) The presentation can be improved. Also, the motivation described in Section 2.2 can be reorganized.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Considering the weak novelty, good experimental results and the new dataset, the reviewer recommends a weak accept.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    After reading the rebuttal, I stay with weak accept because of the contribution of dataset as well as the limited novelty.



Review #2

  • Please describe the contribution of the paper

    The contribution of the paper comes from two parts:

    • A new dataset for surgical video analysis. The dataset should be valuable for the research of both surgical skill assessment and surgical workflow understanding, considering its diverse annotations. Especially, the video dataset is captured in the real surgical scenes. It can be a good supplementary for the exhausted researched dataset JIGSAWS.
    • A new framework is proposed for predicting surgical skill scores from video frames. Especially, the framework attaches attention on resolving the challenge of the long-term temporal relationship modeling brought by 30min+ long videos in the new dataset. It incorporates the event-aware local features, disparity learning across video pairs based on attention module and so-called ‘relative regression’. The framework demonstrates competitive performance on the new dataset.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper demonstrates a new in-vivo surgical video dataset with usable annotations.
    • The paper proposes a new framework for skill score regression from video frames. The framework shows competitive performance on the proposed new dataset.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper is not easy to follow because of many confusing module name and terminology (e.g., relative regression, local-global fusion, intra-video relations, inter-video semantics, and ‘ablate this module’). These terms seem not the standard usage but they are not clearly defined or explained. Also, it is not clear whether the local feature F^S_i and global feature F^G_i are two groups of feature vectors with different time stamps or two vectors for the whole video.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The source code is not provided. Also, it is not clear whether the dataset will be public or not.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    I would recommend the author to provide clear explanations and modification ideas in the rebuttal to the terms like relative regression, local-global fusion, intra-video relations, inter-video semantics, and ‘ablate this module’.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I line to accepting the paper considering the new in-vivo dataset. The writing is a shortage. But if the author can demonstrate the clear direction for modification, it will not be the main rejection reason.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes a long-form thoracoscopy-assisted MVR video dataset as well as a novel events-driven method for the surgical skill assessment of MVR. The method considers both global and local event semantics of each video, based on which inter-video relations are learnt with a relative regression module.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The studied application of surgical skill assessment (MVR) is novel and the collected videos can represent the real surgeries which tend to have long duration and important events.
    2. The presented idea of learning inter-video relations instead of directly regressing over each video is interesting. Especially for the long videos with underlying standard procedures, the skill-related differences can be subtle. Using a regular network may easily get overfitted and not generalize well to unseen videos.
    3. Ablation study is further included in the experiment, which helps understand the effect of each module in the proposed method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. As far as I understand, the event detector is critical in the method as it directly affects the local semantics and propagates afterwards. However, such detector is not clearly elaborated and validated in the experiment. It is stated that the pre-trained detector is frozen in the main training, which makes me concerned about the robustness of method in terms of detection errors.
    2. The global video feature seems to be spatial only instead of spatial-temporal. It remains to clarify and justify the choice of backbone. Besides, the obtained local event-level feature is sparse because the three events only account for a small portion of the entire video. More compact representation can be valuable for further fusion.
    3. The ablation study shows that the concatenation operation for local-global fusion achieves the best performance. However, the local feature is derived from the global feature, leading to information overlap. The relationship of these two features is more like hierarchical instead of parallel. Therefore, direct concatenation is not convincing to me.
    4. In the experiment, the existing methods of action quality assessment are compared while those of surgical skill assessment are not enough.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The author claims to release the codes upon the paper acceptance. The dataset involved in the paper is collected by the authors instead of public. Since the proposed method is specifically designed for the MVR application, it would be valuable to make the dataset available upon request to facilitate future research in this application. On the other hand, more descriptions about the dataset as well as the experiment setup can be attached in the supplementary material.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. The description of basic regression module can be brief, and some justifications about the event detector as well as local-global fusion can be added.
    2. In Section 3.1, the work of [15] is categorized into surgical skill assessment, which is actually in the field of action quality assessment (scoring Olympic events). More recent methods in surgical skill assessment need to be compared.
    3. From Fig.3 it can be observed that for video frames without an actual event, the confidence values are still non-negligible, especially for green and orange. Perhaps such values need to be filtered before deriving the local event-level feature.
    4. For the relative regression block, the attention is widely used in vision transformers. To this end, employing a transformer architecture can be possible for the regression. Currently there seems to be no positional encoding and the features are also spatial without much temporal information.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The overall application and original idea of the paper is meaningful and interesting to me. However, some of the technical details in the designed modules still need to be improved. The novelty is not clearly reflected on the methodology while the experiment is also not sufficient enough for the clinical impact.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Most of the concerns have been properly addressed in the rebuttal. The new dataset should be made public or at least available upon request so that the entire work is valuable to the research community of surgical skill assessment.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a long-form surgical events-driven method for surgical skill assessment (SEDSkill) in Thoracoscopy-assisted MVR, and incorporates video-wise and event-wise relative learning to capture both local and global inter-video relations. The reviewers highlight the introduction of a new clinical dataset for MVR with usable annotations, the proposed local-global relative module, good experimental results and promising performance compared to SOTA as the strengths of the paper.

    The main criticisms of the work are regarding clarification on the novelty of the proposed approach, missing technical details of the implementation, limited validation experiments/comparison to recent surgical skill assessment methods, and the writing not being easy to follow.

    The following points should be addressed in the rebuttal:

    • Clarification regarding the novelty of the proposed approach (including with respect to action quality assessment methods such as FineDiving)
    • Further clarification and justification regarding technical details in the methodology such as the event detector, local-global fusion (as well as clarification on the local and global feature vectors), and justification regarding the choice of backbone
    • Justification for limited validation experiments/comparison to more recent surgical skill assessment methods
    • Clarifications regarding certain module names and terminology, and how these would be addressed
    • Justification regarding the motivation of the paper to focus on skill-related parts not being fully verified by the approach (and not capturing key components and variance in spatial space)




Author Feedback

We thank the reviewers for their valuable feedback. Overall, reviewers (R1, R2, R3) consider our paper presents a valuable dataset for surgical skill assessment and our proposed method achieves great performance on the newly collected dataset. Besides, R1 & R3 agree with our idea is inspiring and interesting. The major concerns of R1 are the difference between ours and FineDiving. And other comments are about the explanation for details and experiments. Below, we clarify important points summarized by the meta-reviewer and resolve possible misunderstandings. (1) The novelty of our method, with respect to action quality assessment method FineDiving)-(R1,AC): The differences is in two primary ways-dataset and methodology. Our videos are significantly longer (30 min) with more noise frames compared to FineDiving’s short (4.2s) videos that contains all key frames. Methodologically, we mine key frames using a key event detector, unlike FineDiving that segments all frames. We also consider both local and global information, contrasting with FineDiving’s local-only. Experiments also prove the superiority of our method. (2) Technical details in the methodology-(R2,R3,AC): Due to the limited page, we only present the main idea and key operation of each module. The event detector is a transformer-like network that maps the video features to the classified logits and regressed start/end time. As described in our manuscript, local-global fusion is introduced to aggregate local and global features by concatenation, addition, multiplication, or attention. The details of the event detector and local-global fusion would be added to the supplementary material. Global features would be feedforward to the average-pooling layer, hence local and global features are two vectors. (3) Comparison to recent methods-(R3,AC): Here, we select two more recent surgical skill methods to compare as follows: | MTL-VF (Wang et al. MICCAI2020) | MultiPath (Liu et al. CVPR2021)| Ours | | 2.31 | 2.22 | 1.48 | (4) Module names and terminology-(R2,R3,AC): We will define or explain module names and terminology more clearly. ‘Relative regression’ indicates predicting the difference between the two video surgery scores; will change it to ‘difference prediction’. ‘ Local-global fusion’ means aggregating the local (event-level) and global (long-form video) information as illustrated in our paper (See Sec 2.2). ‘ Intra-video semantics ‘ means the temporal information among a single video (e.g., relations between frames), while ‘inter-video semantics means the relations between two videos (e.g., the differences). ‘ablate this module’ indicates we conduct ablation study to analyze the effect of this module. Will add more details in the final manuscript. (5) Concerns about global features-(R3): The global features not only contain spatial information. We use temporal convolution layer to capture temporal information for global features. (6) No positional encoding in relative regression block-(R3): this block is used to learn the relations between high-level features (vectors) of two videos instead of temporal features, hence there is no need for positional encoding. (7) Local features are derived from the global features-(R3): both local and global features are captured from spatial features. Hence local features are not derived from the global features. (8) The motivation, i.e., focus on skill-related parts and not capture variance in spatial space-(R1,AC): The motivation for focusing on skill-related parts is based on results discussed with clinicians, i.e., their clinical evaluation of the skill level of MVR is largely based on the three events mentioned in this paper. Furthermore, besides skill-related parts, we also capture other information of the whole video (variance in spatial space) by the temporal feature extractor. From the ablation study (Table 2), both skill-related part (local) and other semantics (global) contribute to the skill assessment of MVR.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The main comments from the reviewers regarding the technical details of the methodology, additional clarification regarding the novelty of the approach, and the motivation to focus on skill-related parts have been addressed by the rebuttal.

    Feedback from reviewers regarding improvements to the writing and better definitions of module names and terminology should be incorporated into the final revision.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have addressed major concerns. Despite some weaknesses, the contribution of a new dataset overweight and hence an accept is recommended. It is suggested that the dataset should be made publicly available with this paper.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I’ll divide my meta-review of the paper into a few distinct parts as the authors’ submission notes that they are submitting this with technical and application aspects as drivers of consideration for advances:

    1) Application framework: While the authors report that skills were assessed by surgeons but they do not report what assessment rubric they utilized, how it was scored (other than a 15 point scale), or how it was validated. While they imply that duration of an operation correlates with skill (this is partially true in the clinical literature), they do not mention accounting for clinical variables which can confound the use of time as a surrogate for skill. Therefore, from an application perspective, there is little to no clinical utility in their skills scores. 2) From a technical perspective, the use of local and global features is interesting especially over longer form video of 30+ minutes. 3) Novelty: Reviewers cite the novelty of the dataset as a strength of the paper. While I agree that having a new dataset on thoracoscopic MVR is of immense value, the authors have not committed to releasing the data. I am also torn on the value of the annotations for the dataset given my above comment regarding the lack of validity regarding the skill assessment. I would worry that the technical community would latch onto such annotations as truly reflective of skill and lead to a diversion of attention toward further exploring such annotations and away from more valid markers of surgical skill.

    All in all however, the authors provide a well written rebuttal with commitment to address the major concerns by the reviewers and meta-reviewer 1. Thus, while I have my reservations about the work, I am inclined to lean, just barely, for an accept decision.



back to top