Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Jann-Ole Henningson, Marion Semmler, Michael Döllinger, Marc Stamminger

Abstract

In recent years, phoniatric diagnostics has seen a surge of interest in structured light-based high-speed video endoscopy, as it enables the observation of oscillating human vocal folds in vertical direction. However, structured light laryngoscopy suffers from practical problems: specular reflections interfere with the projected pattern, mucosal tissue dilates the pattern, and lastly the algorithms need to deal with huge amounts of data generated by a high-speed video camera. To address these issues, we propose a neural approach for the joint semantic segmentation and keypoint detection in structured light high-speed video endoscopy that improves the robustness, accuracy, and performance of current human vocal fold reconstruction pipelines. Major contributions are the reformulation of one channel of a semantic segmentation approach as a single-channel heatmap regression problem, and the prediction of sub-pixel accurate 2D point locations through weighted least squares in a fully-differentiable manner with negligible computational cost. Lastly, we expand the publicly available Human Laser Endoscopic dataset to also include segmentations of the human vocal folds itself. The source code and dataset are available at: https://github.com/Henningson/SSSLsquared

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_4

SharedIt: https://rdcu.be/dnwJm

Link to the code repository

https://github.com/Henningson/SSSLsquared

Link to the dataset(s)

https://github.com/Henningson/HLEDataset

Reviews

Review #2

Please describe the contribution of the paper

The paper presents a neural approach to improve the robustness, accuracy, and performance of human vocal fold reconstruction pipelines in structured light high-speed video endoscopy. The proposed method detects laser dots reliably and with high precision, running at up to 926FPS on images of size 512 × 256 using a 2.5D U-Net architecture. The approach reformulates one channel of a semantic segmentation method as a single-channel heatmap regression problem and predicts sub-pixel accurate 2D point locations through weighted least squares in a fully-differentiable manner with negligible computational cost. Additionally, the publicly available Human Laser Endoscopic dataset is expanded to include segmentations of the human vocal folds themselves.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This study proposes a multi-task network for point localization and semantic segmentation of robust vocal fold structures in structured light laryngoscopy image inputs. To achieve robust point localization output against specular reflections, a linearized Gaussian model-based weighted regression is used for post-processing. The authors claim to achieve a processing speed of 926FPS on a 2.5D U-Net architecture that takes in sequences of 512 x 256 images, with a closed-form solution applied to the regression model for real-time applications. Additionally, the authors utilized the HLE++ dataset, which extends the HLE dataset with vocal fold segmentation masks, for effective model training and plan to release the dataset and model publicly depending on the paper review results.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The biggest weaknesses of the current manuscript are the lack of algorithmic novelty and ablation studies. The authors claim to have improved the point localization network by applying a Gaussian regression model based on previous studies [2] and [9] as a backbone for their UNet-based multitask model. However, the evidence for the improvement in performance by only applying the Gaussian regression model to the segmentation and point localization models is weak. Additional comparisons to state-of-the-art architectures* and ablation studies on the model are necessary to demonstrate the improvement brought by the Gaussian regression model for point localization.
- It is also necessary to explain why the authors utilized a multitask network structure. Does the segmentation model help with point regression? What performance improvements does the reparameterization of the Gaussian regression model bring compared to a general regression model? Can the point regression model and segmentation model be used independently?
- Furthermore, it is necessary to justify why this study addresses the segmentation and point localization problems. Do these two pieces of information increase diagnostic accuracy? Or can the proposed model be used to predict other diseases using the segmentation map and point information? The clinical significance of the proposed model also needs to be justified.
*SOTA architectures
- Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis, 2021.
- UNETR: Transformers for 3D Medical Image Segmentation, 2021.
- Medical Image Segmentation via Cascaded Attention Decoding, 2022.
- Dual Cross-Attention for Medical Image Segmentation, 2023.
- Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation, 2022.
- Medical Image Segmentation via Cascaded Attention Decoding, 2022.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors have decided to make the relevant models and data publicly available, assuming the acceptance of the paper for publication.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

1) An ablation study is needed to examine the impact of Gaussian regression modeling on multi-task training. 2) Comparison to SOTA models is limited. [21] appears to be poorly trained. 3) Please justify the clinical significance of the proposed research. 4) The technical novelty is lacking. Perhaps a related ablation study could supplement this.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

3
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Authors may need to supplement the items listed in number 9 in order to be reconsidered. There have been no relevant experiments conducted to support the strengths of the proposed algorithm. Please also provide justification for clinical significance.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision
- The authors emphasize the real-time performance of the model for practical application in clinical settings. While the suggestion regarding real-time performance is understandable, there is still a lack of comparative studies on the model, which remains a concern.
- Given the potential for clinical utility in the diagnosis of human vocal folds and the authors’ rebuttal, the rating is adjusted to weak reject. However, I would like to point out the insufficient comparative research in certain aspects.

Review #3

Please describe the contribution of the paper

The paper proposes a neural based model for the semantic segmentation and keypoint detection in structured light high-speed laryngoscopy videos. The authors indicate as major contributions (1) the semantic segmentation cast as a single channel heatmap regression problem, and (2) the prediction of sub-pixel accurate 2D keypoint locations through weighted least squares in real time. The experimental results improve the robustness, accuracy, and computational performance of current human vocal fold reconstruction approaches.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The positive aspects of the paper are: (1) the paper is in general well written and simple to follow; (2) the authors state that it is the first publication targeting segmentation and detection in structured light laryngoscopy via deep learning; (3) the experimental results are promising. Specially, the computational performance is significantly improved with respect to competing approaches (926FPS as indicated in Table 1).
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The negative aspects of the paper are: (1) the same dataset is used for training and evaluating the performance of the pipeline. Performance evaluation to multiple endoscopes would make the paper stronger; (2) Analysis of the generability of the pipeline to different diseases would also be important for evaluating the relevance of the proposed work.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper seems to be reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

(1) It would be interesting to understand how many patients are contained in the HLE dataset. The authors mention 10 labeled in-vivo recordings - does each recording belong to a different person? (2) Is data of the same person contained in the training as well as in the test datasets or are the persons separated? (3) Diseases in the vocal folds may alter the structure and motion of them. The authors do not provide any insight about the generability of the algorithm to existence of diseases. (4) What about the generability to different scopes and/or projected patterns?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes a joint segmentation and sub-pixel localization pipeline in structured light laryngoscopy. The experimental results suggest good accuracy performance as well as high computational efficiency (926 FPS). The later is very important for a practical setting. On the negative side, and hence my “weak accept” rating, the generability to different scopes or diseases is not analyzed.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

This paper proposes a CNN that quickly estimates both segmentation and structured light location (for 3d reconstruction) in laryngoscopy. Unlike others, it uses a deep learning model to estimate segmentation and structured light point locations with a hybrid of least squares and learned heat map regression. This work enables the possibility of real-time algorithms with the underlying clinical motivation of improving patient feedback.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper proposes a very fast way to segment vocal folds while sharing compute for keypoint estimation. They have a very interesting combination of deep learning and classical methods to allow multiple for multiple structured light detection from a single channel. Finally the authors increase algorithm speed by running on temporal sequences of images. They provide a comparison of the segmentation and point estimation to other ML baselines (Table 1).
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Explanation of softmax: Softmax in figure 2 is confusing: Are we softmaxing over channels (makes sense for segmentations) or per channel over the image (makes sense for keypoint heatmap). How are these softmax holes (as in the epiglottis in middle softmax im) filled to create a segmentation without holes?
2. 2.5D UNet architecture: Why does a 3D convolution make sense on a reshaped b 1 c h w image? In an initial layer this would make sense, but the spatial contiguity a 3D conv provides makes less sense in the internally reshaped bottleneck layer (seen in supplementary material). This is because the channels have already been mixed here and there is not necessarily spatial contiguity that makes sense for a 3D convolution.
Likewise for the output layer: going from shape ‘b 1 h 256 512’ to ‘b c n 256 512’ seems like a big bottleneck… each 1 feature channel has to become c classes after a single layer. These could be a confusion of notation, please clarify if so.
1. Sequence input: Also how long is the temporal sequence input? I presume it is 5 frames as that is what is in figures, but I do not see it detailed in the text. Likewise on ‘simultaneously predict segmentations channel and batch wise’ do you mean batch wise as in sequences of temporal frames? That would not be a batch, and is instead a sequence. Is the 5-frame sequence input the reason there is a ~5x time performance improvement with the 2.5D UNet (926 vs 220 ms)? As I would otherwise expect the larger UNet to be slower.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

After clarification on the model confusions I have, I believe the authors have an in-depth clarification of their model and training paradigm. Thus I believe reproducibility and reimplementation of this work would be very doable.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

A couple text confusions: Pg. 5: What does it mean to have a ‘box kernel with a 0 set center’?

Pg. 6: By ‘regularize with a differentiable point-based distance metric’, do you mean train? This looks like a training loss objective.

Pg. 7: ‘linearly interpolate between the first and last frame’. What does it mean to linearly interpolate a binary mask?

Pg. 8: I believe you mean ‘reject’ instead of ‘regard’.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper has a very sound idea and implementation, which is novel for this field in addition to being fast enough for practical application. They bolster their paper even more with relevant figures and a motivating narrative. The only weaknesses to me are the motivations for the specific design of the 2.5D network along with the other text confusions I mention in weaknesses. I believe with clarification, these will be amended and make this an interesting and clear contribution suitable for MICCAI.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

Although my opinion on the paper has not changed (in that it should be accepted, but the model could be improved), I believe my concerns from before still stand: “2. 2.5D UNet architecture: Why does a 3D convolution make sense on a reshaped b 1 c h w image? In an initial layer this would make sense, but the spatial contiguity a 3D conv provides makes less sense in the internally reshaped bottleneck layer (seen in supplementary material). This is because the channels have already been mixed here and there is not necessarily spatial contiguity that makes sense for a 3D convolution.

Likewise for the output layer: going from shape ‘b 1 h 256 512’ to ‘b c n 256 512’ seems like a big bottleneck… each 1 feature channel has to become c classes after a single layer. These could be a confusion of notation, please clarify if so.”

A 3D convolution on latent reshaped data does not make sense in principle (even if it does give good results, a 2D convolution should be able to do the same in this case). We are not asking: “(R4) Why does a 3D convolution make sense?”, but instead: why does a 3D convolution make sense on 2D data (NCHW->reshape->N1CHW->3dconv). The initial 3D conv does make sense, but the internal one does not. I believe this is actually just a sparse weight matrix subset of a 2d convolution in this case.

Similarly, the 1-channel bottleneck motivation is unclear.

Our concerns (1, 3) do not seem to be addressed, and we would like that to be done in the paper. That said, the results stand up on their own still, and my decision remains accept.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
This paper proposes a method to simultaneously segment and track keypoints in the vocal folds anatomy from structured light laryngoscopy. The main contribution focuses on demonstrating that this algorithm is significantly faster to compute when compared to alternatives.

Strengths:
- The achieved computational efficiency is relevant
- Simultaneus keypoint detection and segmentation is an interesting problem with much to explore
Weaknesses:
- Some reviewers request for better detail on the clinical motivation for this work, and whether it can be generalised to more clinical applications
- A reviewer states that there is not a sufficient experimental ablation to fully justify this method.
Given the mixed reviews, I recommend this paper for rebuttal. I highlight that significantly new experiments are out of scope for a MICCAI rebuttal, nonetheless the authors should still comment and try to defend the significance of the current experiments wrt lack of ablations. Please also note the reviewers comments on symbol notation for the architecture description.

Author Feedback

We thank the reviewers for their insightful feedback. We appreciate that they found our submission to be “a very novel and sound idea” and that they consider it as an “interesting combination of deep learning and classical methods”. Going forward, we’re going to address the main concerns.

(R2, R3, R4) Regarding the ablation study and comparisons. We agree that an ablation study of the Gaussian regression would enhance the manuscript. Our initial design ideas involved the moment method and Caruana’s algorithm (A), the Newton-Raphson method (B), fully-connected layers operating on small image patches (C) and simply extracting pixel-wise local maxima (D). Methods (A) are prone to outliers, method (B) is iterative, method (C) is data-driven, and method (D) has obvious quantization errors. We chose a classical method that is robust to outliers, as we believe algorithms with well-understood behavior are crucial in the medical domain. The estimated sigma also enables a first estimate of surface normals that are valuable for 3D reconstruction. We tested Guo’s method on facial keypoint detection networks and observed sub-pixel improvements.

Comparisons with different model architectures would be interesting, but we must reiterate that our approach is model independent. Thus we focused on showing the viability of the method rather than comparing different model architectures.

(R2, R3) Clinical relevance? Döllinger et al. highlighted the importance of vertical deformation analysis in human vocal folds. However, systems currently used by clinicians are incapable of measuring these deformations. Our method not only plays a crucial role in vocal fold 3D reconstruction pipelines but also enables analysis of general 2D vocal fold features. While predicting future developments is challenging, we believe that 3D reconstruction of human vocal folds will have clinical relevance. While we do not know of any other medical use cases requiring >1000Hz, there are tasks that require points to be inside specific image regions. These may benefit from our approach. E.g. estimating and tracking human hands of Parkinson’s disease patients necessitates restricting keypoints to the hand region.

(R2) Why do you use a multitask network? The reasons are three-fold: a) employing a multitask network enables us to half the inference time, which is vital in tasks where efficiency is key b) general keypoint detection networks would exceed maintainable VRAM c) we believe that incorporating a segmentation task enhances the robustness of point localization implicitly and explicitly. We acknowledge that further tests on this matter would provide valuable insights.

(R3) How is the dataset structured and how is it used in training? HLE contains data of 10, healthy subjects; each video contains one individual. We evaluate our method on subjects not included in the training set. We agree, more subjects (including non-healthy) as well as endoscopes would support this manuscript. However, these systems are still in active development and the data is hard to generate.

(R4) Why does a 3D convolution make sense? Bottleneck: In our tests, the network learned to utilize this to weight temporal information and interpolate between frames. It also reduces the severity of the sequence jumps that can be seen in the supplementary. Output: While this is indeed a bottleneck, the amount of classes is small in our case. More research regarding 2D/3D network-combinations for improved inference speed (in return for higher latency) would be interesting and might remove these jumps completely. This would certainly not work for problems needing more than a few classes. Finally, we want to stress that we believe that our main contribution lies in the reformulation of the keypoint detection problem.

Lastly, we want to thank everyone for this valuable feedback and hope we could clarify the uncertainties. Minor clarifications regarding notations will be fixed in the manuscript.

The authors

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The rebuttal has partially addressed the reviewers’ concerns, specifically the ones concerning the work motivation. The lack of an ablation experiment is still problematic to one reviewer, and a few technical details about architecture are still unclear (R3).

I believe the key deciding factor here is whether we consider the missing ablations necessary. I acknowledge there may be diverging opinions here. Overall, I am leaning towards accepting this paper:
- The ablation on the Gaussian regression would definitely be interesting, as the authors also acknowledge, but I don’t think this invalidates the overall contribution here, which is is on the simultaneous segmentation + keypoint idea. If there were SoTa methods following a similar logic, then such an ablation would be more fundamental, but as it stands, the paper still validates the core idea of doing segmentation+keypoint detection, which I believe is of sufficient interest to MICCAI
- For the ablation of more backbone architectures, I agree with the authors that this would add little additional information to the paper, given that the core idea is transversal to the back-bone architecture.
- The idea/application combination is novel enough to justify interesting discussion in a MICCAI conference, even if there could be hypothetical architecture adjustments that could further improve this

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This is a novel study focused on jointly segmenting and detecting key points in the vocal folds. The authors were very meticulous in their manuscript and rebuttal. Based on the clinical utility and novel approach, I recommend acceptance of this manuscript.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

I read the paper, and I think it’s a good paper with good contribution. Also, the authors provided goof rebuttal that answers all the minor concerns with the manuscript.

back to top

Joint Segmentation and Sub-Pixel Localization in Structured Light Laryngoscopy