Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jann-Ole Henningson, Marc Stamminger, Michael Döllinger, Marion Semmler

Abstract

Conventional video endoscopy and high-speed video endoscopy of the human larynx solely provides practitioners with information about the two-dimensional lateral and longitudinal deformation of vocal folds. However, experiments have shown that vibrating human vocal folds have a significant vertical component. Based upon an endoscopic laser projection unit (LPU) connected to a high-speed camera, we propose a fully-automatic and real-time capable approach for the robust 3D reconstruction of human vocal folds. We achieve this by estimating laser ray correspondences by taking epipolar constraints of the LPU into account. Unlike previous approaches only reconstructing the superior area of the vocal folds, our pipeline is based on a parametric reinterpretation of the M5 vocal fold model as a Tensor product surface. Not only are we able to generate visually authentic deformations of a dense vibrating vocal fold model, but we are also able to easily generate metric measurements of points of interest on the reconstructed surfaces. Furthermore, we drastically lower the effort needed for visualizing and measuring the dynamics of the human laryngeal area during phonation. Additionally, we publish the first publicly available labeled in-vivo dataset of laser-based high-speed laryngoscopy videos. The source code and dataset are available at https://henningson.github.io/Vocal3D/.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_1

SharedIt: https://rdcu.be/cVRUH

Link to the code repository

https://github.com/Henningson/Vocal3D

Link to the dataset(s)

https://github.com/Henningson/HLEDataset


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a method for extracting a 3D mesh of the vocal folds using laser endoscopy as well as a dataset of such laser endoscopy images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The dataset is an important contribution for anybody working in vocal fold reconstruction.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The literature review is very weak. In particular, it fails to explain how the method differs from the other laser endoscopy based methods [13,16,20,21], especially [21]. The algorithm description is not clear, for example it is not clear where the first two steps MS (mask sweeping), GA (global alignment) were described, since they were not named so in Fig 1 or in Sections 2.1 and 2.2. The dataset is quite small, containing only 10 videos. The 21 Phantom videos should also be included in the dataset. The quantitative evaluation is lacking in many respects. First, it is not clear what kind of labeling is evaluated. Second, a quantitaive comparison with other state of the art methods such as [21] is missing.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The method is reproducible with some effrot.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Improve the lit review to show how your method is different from existing methods, especially [21]. Crystalize your naming of the different parts of the method so that they are described the same in Fig 1, in their text descriptions and in Table 1. Add a quantitative comparison with [21]. Explain what labeling error is being evaluated in Table 1.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Weak literature review and weak quantitative evaluation.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    Motivation for the study is examination for vocal folds for diagnosis if laryngeal and voice related disorders. The study proposes a new framework for real-time (~25 fps) reconstruction of 3D geometry of (vibrating) human vocal folds. The framework uses laser projection unit (LPU) connected to a high-speed camera to acquire information about the vocal folds geometry and well-established methods of parametric reinterpretation of the M5 vocal fold model as a tensor product surface for geometry reconstruction.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed framework appears to be clinically feasible: (a) The images are acquired using a clinical grade endoscopic projection unit (and high-speed camera); (b) The framework is automated and consistent with the clinical workflow time constrains; (c) Real-time (~25 fps) performance is achieved using off-the-shelf computing hardware (i7 CPU and NVidia Quadro RTX 4000 GPU).

    Surface reconstruction: appears to be an innovative extension and application of well-established methods (M5 Model to 3D using B-splines)

    Reasonably extensive evaluation of the proposed framework using a physical (silicone) model of the human vocal folds and established labelled image datasets is reported.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In qualitative evaluation, L1 error and standard deviation of grid offsets are used (Table 1). The following criticism can be raised here: (a) Justification for using these particular error measures is not provided;

    (b) It is unclear how relevant are the error measures used given the context of study (diagnosis of laryngeal and voice related disorders);

    (c) Are the reported accuracy and robustness of the proposed framework sufficient (in quantitative sense) for application in diagnosis of laryngeal and voice related disorders (which is the motivation for the study).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Appears to be satisfactory. However, given the statement in the manuscript that “we publish a dataset containing laser-endoscopy videos of 10 healthy subjects that can be used to drive further research in this area”, the reference (or web link) to this publication will to be provided in case the manuscript is accepted.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    It is a very well written manusctript presenting scientifically sound (and potentially amenable to cilinical workflow) framework/pipeline for real-time 3D reconstruction of vocal folds geometry. My main reservation regarding the proposed framework is that while the reported results seems sufficient to support the conclusion that the obtained 3D reconstruction is visually appealing and can be provided in real-time, they do not appear to satisfy the requirement for quantitative accuracy and robustness that would be required for clinical application in diagnosis of laryngeal and voice related disorders. Providing justification for the error measures used in the study and interpretation of the results in the context of the accuracy required for clinical applications, would improve the manuscript.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It is a very well written manusctript presenting scientifically sound (and potentially amenable to cilinical workflow) framework/pipeline for real-time 3D reconstruction of vocal folds geometry. It appears that the weakness/weaknesses in terms of interpretation of the of the results obtained in terms of the accuracy and robustness that would be required for clinical application can be addressed by revising Results and/or Conclusiosn section without any need for obtaining additional results.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #5

  • Please describe the contribution of the paper

    This paper presents a structured light-based method to reconstruct human vocal folds. A symmetric laser grid pattern is projected on the surface of interests and their 3D locations are estimated after a localizing them in the endoscopic image and correspondence between the camera and the projector is established. Using a parametric model of the vocal folds and the estimated 3D locations of the projected dots, the authors obtain a dense reconstruction of the projected surface. The major contributions are:

    1. An automated method for dense reconstruction of the human vocal folds using a monocular laparoscopic camera system augmented with a laser dot projector.
    2. The quality of the reconstruction is compared to the state-of-the-art using in-vivo datasets. The dataset will be released upon acceptance.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. A fully automatic method is presented for dense reconstruction of human vocal folds using a structured light endoscopic system, and a parametric model of the vocal folds. The entire pipeline runs in 25fps, which seems adequate for the intended application.
    2. The dataset used in this work will be made publicly available which will encourage further development in the field.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The reconstruction pipeline is very similar to what’s presented in the reference [20]. The authors do not clearly distinguish their contributions in contrast to reference [20].
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    1. Adequate details of the algorithms are provided with references where further details can be found. In addition, the source-code and the dataset will be made publicly available upon acceptance.
    2. No information on the sensitivity of the parameters used in the reconstruction method is reported. Several important implementation details are missing as well.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Overall, the paper reads well: adequate background to the problem is provided with reference to the state-of-the-art methods, methods and the results are presented well.
    2. The described reconstruction pipeline is very similar to the one presented in reference [20]. Without explicitly describing how this paper differs from [20], the authors contributions are difficult to identify.
    3. The authors assume a calibration between the camera and the projector. How is this calibration estimated? How are the camera intrinsic parameters estimated? How good are the estimates? This information is crucial to the reproducibility of the paper.
    4. The authors use epipolar lines to constraint the correspondence search. If a centroid of a projected dot hits an epipolar line, it is considered a potential match. However, with errors in calibration, practically, the dots do not exactly hit the epipolar lines. Therefore, some distance measure (between the centroid and the epipolar line) with a threshold has to be considered. What distance did you use?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The presented methods have limited novelty. However, the authors have validated their method on in-vivo human data. In addition, the dataset will be made publicly available encouraging further research in the field. Considering all these factors, I would accept this paper if my concerns listed above could be adequately addressed.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    While the paper relies on several existing algorithms (such as structured light imaging), it outlines a very concrete application and develops a system for performing vocal fold reconstruction. Structured light type approaches have been used for 3D reconstruction in various fields including computer vision, graphics, stereophotogrammetry etc. Please add several references to structured light (for e.g. Veeraraghavan et al.). The reviewers do bring up the point that the literature review is weak.

    All reviewers recognized the clinical application of this paper. Additionally the authors also propose a vocal fold dataset, which will be an important step in the field.

    To summarize, the application is novel, the paper makes clever use of the methods (although details are missing at times) and builds a sound application, and also presents validation results. The authors are suggested to carefully note all the reviewer comments.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9




Author Feedback

We thank the reviewers for their insightful feedback on our manuscript. We appreciate that reviewers found our submission to present a “scientifically sound (and potentially amenable to clinical workflow) framework” and that they consider it as “an innovative extension and application of well-established methods”. Going forward, we’re going to address the concerns.

(R1, R2) Add quantitative comparison with [21]. The ground-truth labels in the dataset were labeled manually using the method of [21]. Thus, our evaluation in Table 1 is a direct quantitative comparison. The goal of our approach is to achieve the same results as [21] automatically and reliably, which we show in our evaluation. We will highlight this in a final version.

(R1, R3) How does our method differ from [13, 16, 20, 21]? [13] and [20] both require manual labeling, where every laser dot has to be selected and labeled by hand. In [16] the authors use a line pattern instead of a grid pattern. While the images in all of these methods were pre-processed for better discernibility, they were ultimately labeled by hand. [21] proposes the first semi-automatic approach for laser dot labeling. Their system provides labeling predictions, but still needs quite some manual input in a post-processing step. Our framework, however, is the first fully-automatic one (to the best of our knowledge). The pipeline receives a high-speed video and instantly outputs a dynamic 3d reconstruction of the oscillating vocal folds. This instant and automatic feedback raises the applicability of the technology and makes it applicable on a larger scale. We understand that we have to highlight this more in a final version.

(R2, R3) Some implementation details are missing. We will do our best to include more detail. Furthermore, we will publish the code with a working example - a link to the public github repository will be included in the final version.

(R1) Can you include the silicone vocal fold data? We will happily include these videos in the dataset as well.

(R1, R2) Why L1-Norm for evaluating the labeling error? Is this clinically viable? Our system uses a grid based labeling, i.e. we label laser dots (x,y) in pixel-coordinates as corresponding to laser (n,m) in grid coordinates. In case of a 31 by 31 laser grid we need to discern 961 different labels, of which not all are a) of interest and b) visible in the image. If our algorithm estimates (n,m) for every label, whereas (n+1,m) would be correct, the L1-Norm is 1. This gives a general intuition about the algorithms performance. While our method does not infer a perfect labeling in all cases, every solution RHC proposes lies on the epipolar lines. Therefore, a metric measurement might not always be feasible, but the proportions of the surface reconstruction are kept, making it viable for visual inspection and relative measurements by practitioners.

(R3) Which distance measure is used for potential dot/epipolar line matches? We generate a rounded trapezoidal mask that is directly dependent on the projected laser dots size. In case of the in-vivo dataset, that is the same as computing the euclidean distance between an estimated dot and its closest point on the epipolar line and filtering these by a threshold.

(R3) How is the calibration estimated? How good are the estimates? That is an excellent question and a point we are going to add to the final version. We use the calibration method proposed in [21], the error measures can be inferred from that paper as well. The calibration data will be included in the dataset.

(R1) MS and GA were not defined in the manuscript. The acronyms were used only in Table 1 as to not exceed the line width limitation given by the template.

Finally, we want to thank the reviewers for their valuable feedback and hope we could clarify some uncertainties/ambiguities.

The authors




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have addressed the reviewers concerns satisfactorily in their rebuttal. The paper used existing algorithms in their work, however their application to human vocal fold construction is novel and is of interest to the medical imaging as well as the clinical community.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    12



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper received mixed reviews, R1 being the most negative one and the other two more positive. The authors made a convincing rebuttal. The paper can be accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    All reviewers agreed that this paper heavily focuses on clinical applications rather than new methodological development. While the proposed pipeline is a combination of previously published work, its clinical application of developing automatic 3D reconstruction of human vocal folds in real-time is new. This research has great potential to open future studies once the collected dataset is made publicly available (as what the authors committed). The experimental evaluation is somewhat limited since this topic is relatively new. However, the high originality of the proposed research will add a nice contribution to MICCAI.

    Overall, despite the score of this paper is not listed as top in my batch, I would happily recommend accepting this manuscript if the authors carefully incorporate reviewers’ comments in their revised version.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



back to top