Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Shaowu Peng, Pengcheng Zhao, Yongyu Ye, Junying Chen, Yunbing Chang, Xiaoqing Zheng

Abstract

Endoscopic surgery is currently an important treatment method in the field of spinal surgery and avoiding damage to the spinal nerves through video guidance is a key challenge. This paper presents the first real-time segmentation method for spinal nerves in endoscopic surgery, which provides crucial navigational information for surgeons. A finely annotated segmentation dataset of approximately 10,000 consec-utive frames recorded during surgery is constructed for the first time for this field, addressing the problem of semantic segmentation. Based on this dataset, we propose FUnet (Frame-Unet), which achieves state-of-the-art performance by utilizing inter-frame information and self-attention mechanisms. We also conduct extended exper-iments on a similar polyp endoscopy video dataset and show that the model has good generalization ability with advantageous performance. The dataset and code of this work are presented at: https://github.com/zzzzzzpc/FUnet .

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_57

SharedIt: https://rdcu.be/dnwP1

Link to the code repository

https://github.com/zzzzzzpc/FUnet

Link to the dataset(s)

https://github.com/zzzzzzpc/FUnet


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a real-time spinal nerve segmentation model, FUnet, for aiding endoscopic surgery, along with a finely annotated segmentation dataset with ~10k frames including ~5k annotated spinal nerve segmentation masks. The authors perform extensive experimental results and ablation studies, compared to existing segmentation models, to show the applicability of their propose approach on existing as well as the proposed endoscopy video dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper writing and organization is nice and clear, overall the paper is easy to follow.
    2. The authors performed extensive experiments to prove the effectiveness and generalizability of their proposed approach, including quantitative comparisons with several existing popular segmentation pipelines, including Unet [5], Unet++ [15], TransUnet (ViT-Base) [16], SwinUnet (ViT-Base) [17] and PNSNet [11], and ablation studies of IFA, CSA modules in the proposed FUnet on the proposed endoscopy video dataset; and quantitative evaluations of FUnet on existing public accessible endoscopy video dataset with competing segmentation pipelines. Overall the experiments and evaluations presented in the paper are extensive and convincing.
    3. According to the authors’ evaluation in the paper, the proposed FUnet can achieve real-time (75 FPS) inference, making it a very practical pipeline for assisting endoscopic surgeries.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Inference speed and practicality. While the proposed FUnet can achieve 75 FPS inference with a single TITAN RTX 24GB GPU, as the authors claimed in the paper, which is very interesting. I wonder how other existing segmentation models, as compared in Table 1, are performing from this aspect? Do they use less network parameters compared to the proposed FUnet? It would be interesting to see quantitative evaluations of the inference speed comparisons of different networks presented in Table 1. Besides, as stated in Section 2.1, the video images resolution is 1080*1080, I wonder what are the input size to the proposed FUnet? Are they downsampled to make sure the real-time inference of FUnet?
    2. While IFA seems to be an important module in the proposed FUnet, certain ablation studies could be interesting to add - How “T” affects the final prediction results?
    3. It would better if the authors can clarify whether the dataset proposed in the paper be made public accessible?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Though the authors did not provide code for reproducing the experimental results presented in the paper, there are clear description of the implementation details in the main paper. I fairly believe the experimental results of this paper are reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Overall I feel the paper is solid and clearly presented, along with extensive experiments to prove the effectiveness of the proposed approach. It would be interesting to see a more general application of the proposed approach in other data modalities and tasks, not only limited to video endoscopic spinal nerve segmentation.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As discussed in Section 5 and 6, while it would be interesting to see more detailed discussions/studies from certain details, e.g. more extensive study of inference speed, overall the proposed paper is clear and solid. The authors provided extensive experiments and ablation studies to prove their points.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper presents the extension of a Unet by an Inter Frame Attention, an Attention Down, and a Channel Self-Attention modules. The extended network architecture is tested on an endoscopic spine data set to segent spinal nerves and another data set to segment polyps.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work shows a modification of an Unet that allows to improve the detection / segmentation capabilities a bit further. The work is tested on two data sets that demonstrate the generalizability of the approach. Nice evaluation performed.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Concerning the data sets and the expamples shown it might be useful to provide some explanations of what is shown. Not all readers will be fine in finding spinal nerves or polyps without explanation and description of the images. It might be advantageous to explicitly tell which data set is showing what. At the end of the evaluation section a data set “Our-VT” and “CVC-612-Valid” appear without specification.

    In the figure captions the word “supplemnetary” needs to be exchanged to “supplementary”

    Section 2.2 only mentions the IFA module while the CSA module should be included here too.

    Section 2.3: it is not evident why the splitting of channels “allows subsequent convolutional operations to share ….” Please Either detail of reference.

    Channel Pyramid: Why is it difficult to capture information across frames and what does this sentence mean? What it inter-frame information? In the same sentence a better word for “losses” might be chosen. This whole section does need extensive referencing or explanations, being the core element of the paper. ADB section: the text does not seem to be in line with the information shown in Fig. 3. Please fix. What is the sequence of lightRFB and the convolutions?

    Section 2.4: Why does reducing the length of an image vector reduce the amount of SHARING INFORMATION BETWEEN PATCHES. The explanation given via convolution is not clear and convincing. The last paragraph of this section can be eliminated and replaced by eq. (2) simply.

    Section 3.1 should contain information on which data set shows what. The “Our-VT” data set should be explained earlier in the data set section; similar to CVC-612-valid.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    cannot be judged

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    see above

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Nice work done that suffers from weaknesses in presentation that can be fixed.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    The authors’ answers have cleared my concerns and the paper is now good.



Review #3

  • Please describe the contribution of the paper
    1. The first semantic segmentation dataset of spinal nerves in endoscopic surgery is established.
    2. The FUnet segmentation network with IFA and CAS modules is designed, which achieved the state-of-the-art performance on the new spinal nerve dataset and the competitive performance on the polyp dataset.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The work is well motivated. Avoiding damage to spinal nerves in endoscopic surgeries is crucial and visual guidance/assistance is a good solution.
    2. The endoscopic image dataset of spinal surgery is built, which has consecutive frames and fine labels.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The details of the new dataset is far from sufficient. Where are the surgical video obtained from? Operating Room or Laboratory? The surgeries are performed by surgeons on patient? The video length is only ~5min assuming the fps is 30? How do you split the dataset?Are the training and testing set selected from the same video?
    2. The comparison experiment setup is not reasonable enough. First, the most recent transformer-based TransUnet and SwinUnet are proposed for medical images instead of surgical images. Second, more multi-frame based segmentation models like PNSNet should be compared. Moreover, it seems that the maxDice score of PNSNet on CVC-612-VT dataset reported in Table 3 is different with the score reported in Ref 11?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The data and code are not public.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Although the dataset has as many as 10k images, the lack of details and the unclear data splitting method make the value of the new dataset kind of vague.
    2. More recent multi-frame based segmentation model should be compared and discussed.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The value of the dataset is vague due to the lack of details and the unclear sub-dataset distribution. The model’s advantage is not convincing due to the minor improvement compared to PNSNet and the insufficient comparison experiments.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The authors add more details about the new dataset. Although I expect more related comparison methods from the endoscopic vision area, the current comparison methods are proposed in recent years and somehow can reflect the superiority of the proposed model.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper presents the first semantic segmentation dataset of spinal nerves in endoscopic surgery and propose FUNet model for segmentation which is shown to outperform the SOTA on the new dataset and competitive performance on the polyp dataset. The paper is well-motivated, as avoiding damage to spinal nerves in endoscopic surgeries is crucial, and visual guidance/assistance can be a promising solution.

    Reviewers have identified some discrepancies in the paper, which must be addressed during the rebuttal. Particularly, following comments should be addressed: (a) lack of details regarding the dataset including data source, surgical setting, and how the dataset was split (Reviewer 3) (b) Comment on the need for a more reasonable comparison experiment setup (Reviewer 3) (c) Code and data is not public. Comment of the plan of making this available publicly for the research community.




Author Feedback

We have considered and summarized the opinions of all the reviewers. Regarding the various issues raised, our responses are as follows:

  1. Regarding more detailed information about the dataset. The data used in our study were collected during surgical procedures performed by spine surgeons themselves. The data formatting, medical equipment, and resolution are described in Section 2.1. As described in Section 3.1, input resolution for the model is set to 256*448. Our dataset consists of extensive surgical footage, covering several hours of surgical procedures for multiple surgeries. Surgeons carefully reviewed and selected high-value, representative, and common scene segments. The dataset comprises dozens of continuous frame segments, each segment consists of 200 consecutive frames. The data was divided into training, validation, and testing sets with proportions of 65%, 17.5%, and 17.5% respectively. To ensure representative and consistent distribution across the sets, the videos were divided into blocks of 5 to 10 segments. Each block was selected to have similar video characteristics, such as the presence of nerves or the use of surgical instruments. The training, validation, and testing sets were proportionally selected from each block.
  2. The release plan for the code and dataset. If paper is accepted, we will release the implemented code and related datasets. We will also provide a configuration document for training and pre-trained weights.
  3. It seems that the maxDice score of PNSNet reported in Table 3 is different with the score reported in Ref 11? In Ref 11 CVC-612-V consists of 112 images, and CVC-612-T consists of 122 images. We present the combined results CVC-612-VT as follows, ((0.873112) + (0.860122))/ (112+122)≈0.866
  4. The comparison experiment setup is not reasonable enough. Although TransUnet and SwinUnet are not designed for this task, they have shown excellent performance in general medical segmentation. Comparing FUNet with them highlights the advantages in endoscopic video segmentation. Additionally, there is limited research on multi-frame networks for endoscopic spine segmentation. PNSNet is representative network with outstanding performance in similar tasks, serves as a suitable point of comparison to showcase the advantages of FUNet. Besides, to demonstrate generalization capability, we compare with a polyp dataset.
  5. What is the effect of T? And other questions in Review2. Reducing frame rate T can improve speed but decrease accuracy. Conversely, increasing T can enhance accuracy, but there may be diminishing returns when T becomes large. Question about “and the length corresponding to each patch is the length of each pixel channel (32×32),” the expression “32×32” refers to size of feature map, not dimensionality. The pixel channel dimension is denoted as C. Each patch is obtained by applying convolution and then flattening the resulting (1,1,C) vector. Additionally, pixels in same position participate in the calculation of different patches, allowing for information sharing among different patches. LightRFB module first down-samples image at the channel level without changing the scale information. At the same time, convolutions operation outputs the weight information at that scale.
  6. What is the effect of fps on endoscopic spine segmentation? Frame rate is not the primary concern, real-time video input operates at 30~60fps, while FUNet achieves efficiency of 75fps, which fully meets the requirements. Instead, we focused on the drawbacks of higher frame rates. For instance, although Unet can achieve 175fps, its accuracy is comparatively lower. This could lead to secondary damage. We made every effort to ensure fairness by keeping the architecture, input resolutions, and hardware consistent. Furthermore, due to the advantages of FUNet’s network architecture, even with larger parameters, other networks like SwinUnet fall short both accuracy and speed (55fps) when compared to FUNet.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors’ managed to addressed all major comments. An accept is recommended.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a real-time spinal nerve segmentation model for aiding endoscopic surgery. The overall framework is promising and the experiments also demonstrated the effectiveness of this work. Most of the issues are also addressed by the rebuttal. Therefore, my final rating is accept.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have clarified most of the major concerns of reviewers, specifically additional details regarding the dataset, explanation on frame rate, and promise on releasing dataset for the research community. I found the proposed method interesting, and the results demonstrate the effectiveness of the method and thus suggest acceptance.



back to top