Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Vincent Bürgin, Raphael Prevost, Marijn F. Stollenga

Abstract

Automatic vertebra localization and identification in CT scans is important for numerous clinical applications. Much progress has been made on this topic, but it mostly targets positional localization of vertebrae, ignoring their orientation. Additionally, most methods employ heuristics in their pipeline that can be sensitive in real clinical images which tend to contain abnormalities. We introduce a simple pipeline that employs a standard prediction with a U-Net, followed by a single graph neural network to associate and classify vertebrae with full orientation. To test our method, we introduce a new vertebra dataset that also contains pedicle detections that are associated with vertebra bodies, creating a more challenging landmark prediction, association and classification task. Our method is able to accurately associate the correct body and pedicle landmarks, ignore false positives and classify vertebrae in a simple, fully trainable pipeline avoiding application-specific heuristics. We show our method outperforms traditional approaches such as Hungarian Matching and Hidden Markov Models. We also show competitive performance on the standard VerSe challenge body identification task.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_46

SharedIt: https://rdcu.be/dnwPr

Link to the code repository

https://github.com/ImFusionGmbH/VID-vertebra-identification-dataset

Link to the dataset(s)

https://github.com/ImFusionGmbH/VID-vertebra-identification-dataset

https://osf.io/nqjyw/


Reviews

Review #4

  • Please describe the contribution of the paper

    The authors propose a method for automatic vertebra localisation from CT scans, the method uses a combination of UNet and graph neural networks to do so. The authors also introduce a novel dataset that contains more challenging cases and showed that classical heuristics struggle to produce good results on such challenging dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is clearly articulated, the method shows promise. The authors detail well every part of the model as well as the justifications.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1- Figure2, is not clear enough, Is the examples shown of figure 2 come from the actual data or is it just a representation of a hypothetical result? 2- L_{edge}, and L_{node} are not detailed enough, they should be written fully.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the method seems to be easily done, the authors provide enough details to do so.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    It is better to put the best results in bold in the tables and classify results from worst to best to make the reading plausible. Also, losses should be detailed.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is very clear, it solves issues that are overlooked and introduces a new dataset, the authors use a simple yet efficient combination of UNet and GNN therefore bridging the gap between both worlds.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a pipeline for automatic vertebral localization, identification, and keypoint association for both the vertebral body and pedicles in CT scans using a heatmap determining U-Net followed by a message-passing graph neural network. The authors also introduce a new dataset that labels pedicle locations on vertebrae of the VerSe, CT Colonography, and CT Pancreas datasets. The method outperforms traditional approaches such as Hungarian Matching and Hidden Markov Models in terms of identification rate and edge F1 score, and is competitive with top VerSe challenge entries. The authors mention the pipeline is generalizable and applicable to other anatomy, but do not specify the future direction of their work.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Methods and comparison: Methods are well explained, but technical novelty is questionable. Authors do not specify if they utilize an adaptation of keypoint legitimacy prediction for their task; it appears to be the default legitimacy prediction algorithm based on their text. The authors put in significant effort to compare a number of architectural variations of their developed pipeline. Further, the comparison of their results to VerSe 2019 scores is also encouraging. Traditional methods are included as baselines to better frame the overall results for each task. Much of the inference studies speak to the robustness of their method, even if it is not fully novel. Ease of reproducibility: the method pipeline of UNet2 and a GNN can be applied to other test datasets (e.g. digital hand atlas database). Although not officially released, the labeled dataset would be a valuable contribution to the community. Clarity: The paper is well written and is easy to follow. It is organized nicely and is nicely formatted, especially the figures/tables.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limited detail in comparisons: While the authors discuss architectural comparison, the reports are more broad than deep and do not offer much insight compared to the state of the art. Metrics such as outlier detection frequency and time to make predictions would be beneficial. Futher, an evaluation of the cited top VerSe challenge scorer’s networks should be run on the private dataset - much of their code is open-source and available. The authors do not mention their hardware used, training time, or inference time. There is no statistical evaluation of results , and paired t-tests would provide statistical weight to the argument that the network is a significant improvement over other networks (or not).

    Loss of focus with regard to initial goal: the authors mention the orientation and direction of vertebrae is often ignored and this problem should be addressed however, the results do not directly demonstrate any correspondence between the keypoints detected and the orientation of the vertebrae themselves.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The methods within appear highly reproducible, especially for the work of the VerSe dataset; however, much of the overall reproducibility of the study will depend on the release of the author’s dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Discuss how oreintation and direction of the vertebrae are easily distinguished given successful keypoint detection.

    Clearly explain technical novelty of the method, especially regarding the adaptation of keypoint legitimacy prediction for their task.

    Provide more detailed comparisons with other state-of-the-art methods, including training of these methods on the 2118 spine dataset. Afterword, apply a statistical evaluation of results.

    Specify hardware used, training time, and inference time to give readers a better understanding of the feasibility and scalability of the method.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The submission does not speak much to novelty and could use some comparative improvement, but is generally a well written and nicely executed study with a promising dataset. The authors should have the opportunity to provide their rebuttal for the reviews.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a 2-stage approach for vertebra localization, segment classification and pedicle landmark association. In stage 1, a U-Net is used to detect keypoints classified as left/right pedicle or body. In stage 2, the clustered keypoints are input to a GNN for node and edge classification. Output of the GNN is the spine-level classification and keypoint legitimacy via node classification and body-pedicle association via edge prediction. Annotations were created for an extended CT dataset (pedicle locations + vertebra level labeling).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method ignores false positives and successfully recognizes the corresponding vertebra body centroid and pedicle. It outperforms Hungarian matching and HMM. The annotated dataset will be made publicly available. Implementation of a bootstrapping strategy for automatic pedicle keypoint, vertebra body keypoint, and vertebra-level label annotation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Comparison to baseline approaches: Is the graph approach necessary? What is the advantage compared to multi-label vertebra segmentation? This can directly solve the left/right pedicle assignment problem. Limited evaluation: Significance test, std not reported, missing qualitative results / discussion of failure cases

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is possible to reproduce the proposed approach if the data is available. Datatset + annotations will be published. Some detail remain unclear: The initial pedicle keypoint annotation process via postprocessing of segmentation mask is not clear. How are the spinal segment pseudo probabilities for the node feature generated? Does the heatmap network distinguish 3 different keypoint classes (body, left, right pedicle)? how many images does the hard subset contain? Related work summary could be improved, currently only listed.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • nnUNet multi-label vertebra segmentation and subsequent connected component analysis for vertebra body keypoint extraction seems like a suitable baseline approach that is not evaluated. Do false positive body keypoints also occur with nnUNet multi-label vertebra segmentation? The multi label vertebra mask could be employed for non-maxima suppression of wrongly detected pedicle keypoints, also the spine-level assignment could be derived based on the masks
    • it is stated that in contrast to other approaches, the vertebrae orientation is derived, however, the pedicle keypoints alone do not full describe the orientation, inclusion of coverplates would be required as well
    • Were the baseline methods retrained on the extended dataset? It is unclear whether the performance difference results from the methodological differences or the different training data
    • Table captions need more details, which task is evaluated, which keypoints are evaluated by each metric
    • legitimacy prediction is disabled because the identification rate is largely unaffected by false positives, however, Tab. 3 shows the opposite
    • results, tab.1. : “Our method outperforms the baseline methods”, not true for edge accuracy in joint mode, standard deviation, statistical tests not reported
    • discussion: why is d_mean significantly lower for the proposed method? Is the comparison performed on the same set of vertebra or the number depend on the identification rate?
    • why do the architectures differ for both evaluation tasks? (Tab. 1 and 3)
    • edge / node features: why is the spatial information of the keypoints not used as input for the GNN, only represented via k-NN and distance
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The dataset annotation is a valuable contribution. There is methodological novelty, but comparison to simpler direct approach via multi-label vertebra segmentation is missing. It is difficult to follow the description of the method and experiments.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This work proposes an algorithm for vertebra localisation from CT scans, combining a UNet for heatmap regression and a graph neural network for identifying vertebra. Evaluations are performed based on VerSE and additional data, with extended annotation which will be made available upon acceptance. Reviewers agree in favor of the paper due to its small but overall meaningful contribution and good evaluation protocol. A number of minor issues are mentioned by reviewers, especially clarifications of details like training/inference times, statistical significance tests or presentation issues. It is recommended to discuss minor issues in a rebuttal and to put additional details into supplementary material. However, following MICCAI rules, it is not required to perform additional experiments.




Author Feedback

We thank the reviewers kindly for their very detailed and constructive comments that will help improve the final version of the paper. We are especially glad that the clarity of the paper is well received.

R2 and R3 ask about our definition of ‘vertebra orientation’, which may indeed be ambiguous. We chose the pedicle-body plane as our orientation plane for being the most stable and we will make this more clear in the text. Using end-plates as suggested by R2 would also have been an option but would lead to more issues when end-plates are not aligned or flat (which happens quite regularly in our pathological datasets).

R2 and R3 ask us to test the statistical significance of our results. We agree that this would improve the work, and will provide p-values for a paired difference test between our method and the baselines in the final version.

Both reviewers also wondered if there is no ‘simple’ other method to compare against, e.g. a Unet that directly segments and identifies the vertebrae. Our experience is that training segmentation networks with that many output classes is not that simple, sometimes even failing to converge to a reasonable solution (e.g. without a collapse of some of the output channels). Evaluating against such networks did not seem a fair comparison.

Our approach is the result of significant experimentation, also with datasets of real surgery cases (from private sources that we could unfortunately not use for this paper) on which simple approaches regularly fail at segmenting or identifying some vertebrae. Even in some public datasets and models like TotalSegmentator, we noticed such issues. We settled on the comparisons we have in the paper:

  1. A comparison on the ‘vertebra orientation’ dataset with 2118 images we introduced that contains annotations for pedicles. We compare our method to the heuristic method (HMM + Hungarian matching) which we have previously used to solve this task.
  2. A comparison on the established VerSe dataset; a simpler task (body only) on which many methods have been optimized, since it was an official challenge. Here we are able to show our method produces competitive results (while not SOTA on this dataset) without being purely optimized to this dataset.

Tab. 3 shows the id.rate to be slightly worse with legitimacy predictions turned on, and R2 wonders how this relates to false positives. There are two factors at play here: While false positives could play a role, the result can more likely be attributed to the network’s vertebra level prediction not training as well because it simultaneously tries to optimize legitimacy predictions.

Furthermore we will use your comments (and the added half page) to clarify the paper using your comments, most notably:

  • Clarify the descriptions of the tables, highlight the best results, and clarify the statement ‘our method outperforms the baseline method’.

  • Clarify the method description to avoid any confusion on the legitimacy prediction.

  • Clarify that the qualitative examples in Fig.2 are modified by augmentations before running the GNN and heuristic methods, to illustrate several observed failure types (missing/duplicate body/pedicle detections) in one figure.

  • We will write out the losses.

  • The hardware used and training/inference times, currently described in the supplementary, will go to the main paper if there is space.

  • Discuss failure cases, the most frequent being when the predicted spine segment is off by one level (e.g. last thoracic already detected as a lumbal level), and the GNN hence predicts several vertebra levels off-by-one.

  • Clarify the size of the hard subset of the 2118 dataset’s validation set (which is 47).

  • Clarify that the heatmap network outputs 3 channels for body/left/right, and the pseudo-probabilities come from applying the sigmoid function to each channel individually



back to top