Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Han Wu, Jiadong Zhang, Yu Fang, Zhentao Liu, Nizhuan Wang, Zhiming Cui, Dinggang Shen

Abstract

Accurately localizing and identifying vertebra from CT images is crucial for various clinical applications. However, most existing efforts are performed on 3D with cropping patch operation, suffering from the large computation costs and limited global information.In this paper, we propose a multi-view vertebra localization and identification from CT images, converting the 3D problem into a 2D localization and identification task on different views.Without the limitation of the 3D cropped patch, our method can learn the multi-view global information naturally.Moreover, to better capture the anatomical structure information from different view perspectives, a multi-view contrastive learning strategy is developed to pre-train the backbone.Additionally, we further propose a Sequence Loss to maintain the sequential structure embedded along the vertebrae.Evaluation results demonstrate that, with only two 2D networks, our method can localize and identify vertebrae in CT images accurately, and outperforms the state-of-the-art methods consistently.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_14

SharedIt: https://rdcu.be/dnwGR

Link to the code repository

https://github.com/ShanghaiTech-IMPACT/Multi-View-Vertebra-Localization-and-Identification-from-CT-Images

Link to the dataset(s)

https://s3.bonescreen.de/public/VerSe-complete/dataset-verse19training.zip

https://s3.bonescreen.de/public/VerSe-complete/dataset-verse19validation.zip

https://s3.bonescreen.de/public/VerSe-complete/dataset-verse19test.zip


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a method to locate and identify vertebrae in CT images by first using 2D projections to locate and identify vertebrae, and then reconstruct them in 3D. A claimed contribution is the use of a sequential loss which is very similar to the graph approach used in [13].

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The 2D projection of the images and the 3D reconstruction of the results has interesting properties as shown by the ablation study in Tab 2.

    The contrastive learning strategy on multi-views is interesting for pretraining and validated numerically.

    The obtained results on verse19 outperform the state of the art.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There is no mention of a validation set for hyperparameter tuning. How were these adjusted? Choosing the parameters to optimize the test set results would invalidate the current numbers.

    Comparison with [13] is missing:

    • The proposed sequential loss is very similar to the graph strategy proposed in [13]. Authors should fix the citations to [13] and improve its discussion (detailed feedback provided).
    • Numerical comparison to [13] is missing. Authors should compare using the verse20 dataset - which includes part of verse19.

    The analysis and discussion of the results on transitional vertebrae is missing (occurence of T13 and L6 or absence of T12). The verse dataset was created with the goal to account for these challenging cases - how does the method deal with them and perform on them?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    While authors made the effort to provide “Implementation Details” (Sec 3.2 could go to Sup Mat), it would be hard to reproduce results without the code:

    • the contrastive learning has several hyperparameters not detailed
    • how the views are rendered is not precisely described
    • the implementation of the sequential loss would not be straightforward
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Page 2 line 1 is not correct: [13] does not use a graph neural network, their identification method is global and it does not “destroy” the global information. This should be fixed and properly compared with the proposed sequential loss.

    Authors propose to use least-square for the 3D reconstruction, which is not robust to outliers. What is the distribution of distances between the 3D point and the 3D lines? A bad detection could heavily bias the estimation. Why not use a method robust to outliers (Least absolute deviation, M-estimation, …) ?

    Page 4 end of first paragraph: “Further, … “ Why mention the fusion here and not in 2.4? This is misleading. I understand the detection is made by individual view. This sentence raises the doubt if the detection (2.2) is already multi-view.

    Sec 2.3. How many labels are considered in the identification problem? Same remark as before, the sentence on inference with the “majority number of labels” lets one think that the identification is multi-view, but this is just the aggregation step?

    Sec 2.3. How many views are used as input. I think 1, but it’s not clear to me.

    Sec 2.3: How are transitional vertebrae handled? Is there a particular strategy to account for their very small sample size?

    Fig 1: L6 appears on the rows of “Sequence loss” but not on the columns. It does not have a color on the blue area (Step 2). The figure could help complement the explanation on how this is handled.

    Could authors better explain the rational in the D = max(…) of Eq 1. ? Why use the max of previous current or next? This is a bit confusing.

    Sec 2.4. Is identification used to merge the detections? From the third sentence, one would understand that it is “For a verteba located at … “. But if there are errors in the identification, then far away locations would result in wrong vertebrae locations…. thus the need to use a robust method (see above). If identification is not used, is there an inlier threshold used to decide which rays are going to be merged together? This explanation should also be improved.

    End of sec 2.4. “Specifically, we identify … “ Why use the last row and not the first? Is this choice arbitrary or is there a motivation (ablation?)? What about the L6 cases?

    3.1 Is there a validation set for hyper-parameter tuning? What is it’s size? This should be explained here. If not, how are hyper-parameters tuned?

    Table 1 should be extended with the results on verse20

    The correct citation for [13] should be: Di Meng, Eslam Mohammed, Edmond Boyer, Sergi Pujades. Vertebrae localization, segmentation and identification using a graph optimization and an anatomic consistency cycle. Chunfeng Lian; Xiaohuan Cao; Islem Rekik; Xuanang Xu; Zhiming Cui. MLMI 2022: Machine Learning in Medical Imaging, 13583, Springer, pp.307-317, 2022, Lecture Notes in Computer Science, 978-3-031-21013-6. ⟨10.1007/978-3-031-21014-3_32⟩

    https://link.springer.com/chapter/10.1007/978-3-031-21014-3_32

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I would not accept the paper in the current form:

    • discussion on how the sequential approach is different and novel wrt [13] is missing. Page 2 line 1 is not correct: [13] is not a graph neural network, and their identification method is global and does not “destroy” the global information.
    • Numerical comparison on verse20 with [13] should be provided

    An appropriate rebuttal could allow to answer these questions. If verse20 results are over state of the art and the discussion is added, I would rather go with acceptance.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    I’m still hesitant with this paper.

    On the two stated contributions, namely identification on 2D images instead of 3D, and the sequence loss:

    A. The identification on 2D images instead of 3D is not novel. For instance, in the current paper, [18] is presented in the numerical comparison, but there is no discussion on how the presented approach differs from their strategy, which also performs detections in 2D and then merges them. This omission in related work clearly mislead reviewers.

    B. On the sequence loss, the rebuttal does not clarify how the proposed approach based on dynamic programing is different from the graph [13] optimized with a shortest path. As shown in Fig 1 and not discussed in the rebuttal, probabilities are aggregated over a path, as done in [13].

    In addition: C - Handling of pathological vertebrae is not discussed and is of major relevance. Most failure cases of methods in verse20 arise from label shifts due to transitional vertebrae. How this is incorporated in the dynamic programming step is not straightforward and it is not described in the paper, nor the rebuttal, nor illustrated in Fig.1.

    D - Fusion of the detected vertebrae (2.4). The fusion of detected vertebrae was identified as one paper strength by reviewers. It raised some comments/questions. Authors replied by stating that in a step of their work they “remov(e) the views with outlier predictions”. This information is added in the rebuttal, but there is no mention in the original paper. The rebuttal does not provide a clear description on how this “removal” is performed, or how was it tuned.

    The benefits of the paper are:

    • the pre-training with contrastive learning
    • the very good accuracy, outperforming [18] and [13] (stated in rebuttal)

    In rebuttal authors state that the code will be made available. This could help on the reproducibility and the clarification of the details.

    For these reasons I’m still hesitant whether the paper should be accepted.



Review #2

  • Please describe the contribution of the paper

    The contribution of the manuscript is a novel multi-view 2D based approach for the identification and localization of vertebrae in 3D computed tomography images. The main novel aspects of the work are the novel muli-view voting scheme proposed for fusing predictions from multiple synthetic 2D views and the sequence loss implemented. The sequence losses seeks to maximize the probability that vertebrae are labelled correctly by taking advantage of the monotonic sequence that vertebrae appear in and combine this with weights of each vertebrae based on the certainty in their label.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Novel approach that uses a multi-view 2D approach for localizing and identifying vertebrae in 3D o Previous work has primarily focused on a limited number of 2D (coronal and sagittal) or on 3D assessments of spine imaging. o By formulating the problem as a 2D problem the authors can limit the complexity of the architecture needed for each 2D views identification and localization when compared to 3D architectures. o This could lead to improvements in the speed of processing or a decrease in the amount of training data needed
    • The investigation is thorough in its comparison to previous work, showing improved performance. o The investigation compares performance to other investigations from the VerSe challenge.
    • The paper is well written with excellent explanations of many parts of the method implemented.
    • The ablation experiments that are done provide interesting insight into what aspects of the network are most useful o The mutli-view voting appears to make the biggest difference in performance, causing improvements 9% and 6% differences in the identification rate of vertebrae in the test and hidden sets o the contrastive learning pre-training has an effect but it appears to be small relative to the other changes. o The sequence loss appears to make a meaningful difference but not as large as the multi-view voting
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The rational for the investigation is not well stated. There are many solutions to the VerSe challenge. Why did the authors implement this algorithm? Does the algorithm meet the clinical need they are attempting to address?
    • The methodological novelty is limited. Using 2D representations of 3D imaging has previously been done. Using the sequence of vertebrae has also previously been done. The exact formulation by the authors for the sequence loss is different. If the authors could explain why it is implemented differently and what that reveals that would provide more motivation for this investigation.
    • The ablation could be more thorough. Not all combinations are investigated. The effect that pre-training seems small and the voting seems big. But voting is not tested without pre-training. Does voting remove the need for pre-training?
    • The discussion of the results is extremely limited. How do the results agree or not with the literature?
    • The ablation descriptions could be clearer. How is multi-view voting ablated. What is the alternate strategy, 1 2d projection, averaging without weighting the predictions?
    • How are the number of vertebrae in a scan determined by the method? This is not clearly stated.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • good things about reproducibility o open datasts used o objective function and training are well described o preproccessing and post-processing are well described
    • limiting the ability to reproduce o code not provided
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • add results to the abstract this will make understanding the study much easier from the abstract
    • add motivation or clarify what the author means by “poor performance” in the context of a clinical need or related to how this investigation contributes to advances in methodology. The comparisons to the literature in the results do not appear to be consistent with the characterization of “poor performance”.
    • The authors could further strengthen this approaches superior performance by comparing it to the current leaderboard.
    • In the discussion and conclusion expand the explanation of why it is superior. Do the results suggest contrastive learning would be helpful in 3D for this task?
    • The authors mention using ImageNet weights in the introduction but then do not seem to use them in the methods. Please explain.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The architecture is novel with some interesting aspects and some useful ablation results. The need for the investigation is a little unclear.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper describes a method to detect and identify vertebra in CT scans. The method is based on a neural detection working on 2D DRR projections, pre-trained with contrastive learning and trained with a new sequencing loss that promotes consistent vertebrae labels inspired by dynamic programming. The 2D detections are then by back-projected to the 3D space and clustered into the final detections. The authors evaluate the algorithm on the VerSe 2019 dataset and report better results than the current challenge leaders.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The two main strengths of the paper are the performance of the described algorithm (both error distances and identification are quite impressive. On the public datasets, the method outperforms the state of the art) and the originality of some of the components:

    • using DRRs instead of the 3D volumes allows the network to see the whole spine without huge memory requirements,
    • the novel sequence loss makes the identification network produce plausible and consistent detections along the whole spine
    • the pre-training with contrastive learning, which does seem to improve the results (although I would not necessarily have expected it)
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of the paper is the sometimes unclear description of some parts of the pipeline. While the reader is able to get the gist of it, I am not sure all the details required to fully reimplement the method are described. For instance

    • what was the loss of the multi-view contrastive learning?
    • the sentence “we use a segmentation model to predict the vertebra labels around the detected vertebra centroids” is a bit vague: does the model output a 25-channel heatmap?
    • the sequence loss description could also be more thorough, in particular the rationale behind the parameters alpha and beta
    • In the sentence “we identify the index of the largest accumulated probability in the last row as the last vertebra category and utilize it as a reference to correct any consistencies in the prediction”, I do not understand how this allows to correct errors.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have used a public dataset (VerSe) for their evaluation.

    Since the pipeline is complex, it would really be beneficial to have the authors’ code in order to be able to reproduce the results. The reproducibility checklist filled by the authors does state that the code will be released but:

    • the submission/paper does not mention it;
    • some answers of the reproducibility checklist seem inaccurate (for instance error bars, runtime, memory footprint, etc… were claimed to be reported but actually were not) so I am not sure how this can be trusted.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    My main remarks on what could be improved in the description of the method have been mentioned in question 6. A couple of other questions/comments:

    • Since the network relies on a backbone pretrained on ImageNet, the input size is fixed. However, the extent of a spine CT can vary greatly depending on the field of view; this means that the size in pixel of a vertebra also varies a lot. Isn’t this a problem?
    • “It is intuitive but usually involves many segmentation artifacts” -> this sentence would deserve more justification
    • In the multi-view fusion part, a least square error is optimized to get the final position. Isn’t this particularly sensitive to outliers?
    • Tables 1 and 2: reporting only one number is not sufficient, adding other statistics like standard deviation, or quartiles would give a more complete picture.

    Typos:

    • “often lead to serve” -> “severe”?
    • ” but the long term sequential information is well studied” -> is there something missing? I don’t understand the sentence
    • The last paragraph of section 3 is not an “ablation study” (since no parameter/component is removed), but just a hyperparameter tuning
    • “while another works[14,18] localize” -> “other”
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a very interesting paper with creative ideas, in particular the use of 2D images (DRR) instead of 3D volumes, and the sequencing loss to enforce consistency in the vertebrae identification. The accuracy of the method is also impressive and seems to outperform the state-of-the-art. I did not find any significant flaw.

    This is overall a really nice work and there is not much to ask apart from confirming that the authors will release their source code and can improve some parts of the method description.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    My main remarks on this paper were related to the description of the methods, which is a bit hard to really clarify in a rebuttal. I hope the authors will reformulate the unclear parts in the final version. Apart from that, the authors confirmed that they will release the source code. I am therefore keeping my recommendation towards acceptance.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
    • Strengths:
    • effectiveness of the multi-view 2D approach
    • clarity of exposition
    • careful exploration of the hyper parameter space

    The reviewers point out the following significant shortcomings that should be addressed in the rebuttal:

    • What are the relationship to, and contribution with respect to, reference [13]?
    • Why was least-squares used for 3D reconstruction, as opposed to a technique that is robust to outliers?
    • A more thorough discussion of the methods, hyper-parameter tuning/ablation studies, the use of contrastive learning.
    • Comparison to the leaderboard, if it is possible to do without additional experiments.




Author Feedback

Thanks for all reviewers’ positive and thorough comments to help us improve paper quality. We summarize the comments into 7 points with respective answers below.

Q1(R1,Meta-R) Relationship, and contribution with respect to reference [13]. A: Work [13] is indeed a benchmark method, primarily operating on 3D patch and employing prior knowledge of spinal order in graph optimization. In contrast, our approach offers two significant improvements over [13]. First, we execute our method on multi-view 2D images as opposed to 3D patch. This mechanism increases the receptive field, allowing to capture global information more efficiently. Second, we introduce a novel sequence loss function based on dynamic programming, which allows the connection edge to be flexible and importantly enables supervision of the identification network during training, thus extending its effect beyond post-processing. Our evaluation results on VerSe20 also outperform [13].(We will update this in the final paper)

Q2(R1,Meta-R) Why not validate on VerSe20. A: VerSe20 challenge is primarily focused on segmentation and identification, whose official leaderboard did not include localization as Verse19. Our main objective is to offer a fast and accurate solution for vertebra labeling, which is of significant importance in spine surgery. Therefore, we validated our method on VerSe19 and a large-scale in-house dataset(with 500 CT images in supplementary), demonstrating not only excellent performance but also strong robustness. Additionally, as suggested by R1, we also evaluate the identification task of our method on VerSe20 and achieve leading performance on the leaderboard. We will include this information about VerSe20 in the final paper.

Q3(R1,R3,Meta-R) The least-squares in localization fusion? A: Actually, we already remove the views with outlier prediction numbers before the least-squares, which makes the fusion encounter minimal outliers along the 3D lines. However, we acknowledge that there might be potential accuracy enhancement by adopting more robust strategy as mentioned by R1.

Q4(R1,Meta-R) About hyper-parameter tuning. A: In VerSe19, we randomly split 70/10 for training and hyper-parameter tuning from 80 training dataset. In our in-house dataset, we randomly split 300/100/100 for training, hyper-parameter tuning and testing.

Q5(R2,R3,Meta-R) About the contrastive learning strategy. A: The contrastive learning is designed to further enhance the backbone’s capacity for extracting consistent anatomical information across varied 2D views. We compare it with backbone solely pretrained on ImageNet, and gain 3% improvement in ID rate. More descriptions(e.g., cos similarity loss) of this part will be updated in the final paper.

Q6(R2) Add motivation and clinical influence. A: Most research works have achieved promising results on vertebra labeling task(about 4mm distance error and 94-95% ID rate); however, a more accurate localization is always needed, especially in application of spine surgery robot with computer-assisted navigation. Our motivation is to provide a fast and accurate solution by transforming the 3D labeling task into 2D format. As a result, we achieved a distance error of about 2mm and ID rate of 96-98% on the VerSe19 and large-scale in-house dataset. The robustness and effectiveness of our method can better satisfy clinical demands and bring meaningful improvement to the field of spine surgery planning.

Q7(R1, R3, Meta-R) Details about Eq. 1 and multi-view ID fusion. A: For Eq. 1(by R1), we aim to track an ascending sequence with maximum sequential information based on dynamic programming. Therefore, we employ the max function to derive the maximum accumulated probability. For multi-view ID fusion(by R3), we adopt a bagging strategy(where for each vertebra, we opt for the ID that is predicted by the majority of views) when not using weighted voting.

Other detailed description and released code link will be presented in the final paper.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have addressed a good part of the concerns in the rebuttal. There are still concerns from Reviewer 1 who provides a detailed list of concerns that would be very difficult to address in a rebuttal.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    After integrating all information, I have concerns on this paper’s technical novelty and experimental evaluation.

    1, Reviewer #1gave detailed concenrs on the clarity of the technical novelty which I agree with. They are very valid concerns (listed as A/B/C/D).

    2, Authors only use one public dataset to evaluate their method. That dataset may not be the state-of-the-art dataset. And there are missing important references as well.

    SpineWeb dataset consists of 302 CT scans with vertebra center annotations. This dataset is commonly considered challenging and representative for this task, due to various pathologies and imaging conditions that include severe scoliosis, vertebral fractures, metal implants, and small field-of-view (FOV).

    VertNet: Accurate Vertebra Localization and Identification Network from CT Images, MICCAI 2021

    Automatic Vertebra Localization and Identification in CT by Spine Rectification and Anatomically-constrained Optimization, CVPR 2021



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    authors have a successful rebuttal period. Most of the concerns were addressed.



back to top