Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yankai Jiang, Yiming Li, Xinyue Wang, Yubo Tao, Jun Lin, Hai Lin

Abstract

Accurate cephalometric landmark detection is a crucial step in orthodontic diagnosis and therapy planning. However, existing deep learning-based methods lack the ability to explicitly model the complex dependencies among visual features and landmarks. Therefore, they fail to adaptively encode the landmark’s global structure constraint into the representation of visual concepts and suffer from large biases in landmark localization. In this work, we propose CephalFormer, which exploits the correlations between visual concepts and landmarks to provide meaningful guidance for accurate 2D and 3D cephalometric landmark detection. CephalFormer explores local-global anatomical contents in a coarse-to-fine fashion and consists of two stages: (1) a new efficient Transformer-based architecture for coarse landmark localization; (2) a novel paradigm based on self-attention to represent visual clues and landmarks in one coherent feature space for fine-scale landmark detection. We evaluated CephalFormer on two public cephalometric landmark detection benchmarks and a real-patient dataset consisting of 150 skull CBCT volumes. Experiments show that CephalFormer significantly outperforms the state-of-the-art methods, demonstrating its generalization capability and stability to naturally handle both 2D and 3D scenarios under a unified framework.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_22

SharedIt: https://rdcu.be/cVRs8

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a two-stage framework for 2D and 3D landmark detection. It first detects the coarse landmark by a unet network inserted with CephalFormer Block and then refines the landmark by a sequence of CephalFormer Block, where the Transformer explicitly takes the global structure constraint. The method outperforms the state-of-the-art methods on two public cephalometric landmark detection datasets and a real-patient dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The method is carefully designed and significantly outperforms the state-of-the-art methods.

    • the use of local group attention and global group reduction attention is an efficient way to capture the long-range dependency and alleviate computational burden at the same time.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • It is confused about the landmark embedding. What is the definition? How is it calculated in this paper? Is it a coarse visual feature from the coarse predicted heatmap?

    • To strong statement about the global structure constraints. It hardly tells the technical novelty compared with it proposed in [p1p2].

    • Some method details are not clear in the Fine-Scale Coordinate Refinement section. For example, what is C of patches P? R^{g}? and label embedding? It is suggested to define a notation before using it.

    [p1] Structure-aware long short-term memory network for 3d cephalometric landmark detection (TMI 2022). [p2] Structured landmark detection via topology-adapting deep graph learning (ECCV 2022).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code is promised to be public.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • the landmark embedding should be well defined and evaluated, Is it a coarse visual feature from the coarse predicted heatmap?

    • the statement about the global structure constraints should be turned down.

    • More details are needed in the section on Fine-Scale Coordinate Refinement.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Considering the impressive performance and the careful network design, I tend to accept the submission in pre-rebuttal. However, some details need to be elaborated. The authors claim that they are the first to explicitly model the dependencies between all visual features and anatomical landmarks. But the landmark embedding is not well defined and evaluated, i.e., what is the definition? How is it calculated? Moreover, What is the insight? How does it affect the performance? (without the landmark embedding) Besides, the way to model global structure constraints is very similar to the [p1p2]. Considering the unclear definition and lacking evaluation of landmark embedding, it is doubtful that the current representation supports such a strong statement. The final score is depended on the response.

    [p1] Structure-aware long short-term memory network for 3d cephalometric landmark detection (TMI 2022). [p2] Structured landmark detection via topology-adapting deep graph learning (ECCV 2022).

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The manuscript proposed a two-stage method with transformer-based neural networks for 2D/3D cephalometric landmark detection. The proposed method utilizes a transformer-based network for coarse predictions of landmarks, and uses self-attention layers to refine landmark prediction combining information of high-resolution image appearance and low-resolution predictions. Moreover, the experimental results validated model performance of the proposed method on three different data sets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • The submitted contents are related to the application of neural networks in the field of anatomical landmark detection in 2D/3D medical images, which is highly relevant to the MICCAI audience. • Experimental results support the claims made in the paper. • The paper is well-organized and well-written.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • The technique in the manuscript might not be sound. The two-stage method directly predicts the coordinates from the models, which might not be generalizable when applying the models to images with different (e.g., larger, or smaller) field-of-review, especially for 3D CT. And the second-stage model heavily depends on the predictions from the first-stage model. If the first model misses the landmark, it is very difficult for the second one to locate it.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the implementation is not easy to achieve.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Are the models in the proposed method trained end-to-end? Or are they trained sequentially?
    2. Why not predict the coordinates of the landmarks from the first-stage model? Or predict the heatmap from the second-stage model?
    3. If the first model misses the landmarks in the predictions, is it possible for the second-stage model to get them back in the final predictions?
    4. In experiments with 3D CT, are all volume resampled to the same size or the same the resolution/spacing? If resampling to the same spacing, how does the model address the situation that image shape is different from model input shape?
    5. In general, will more CephalFormer blocks help for better model performance?
    6. What is the ground truth for heatmap prediction? Why is cross-entropy loss used here since it has been used classification tasks in the literature? It is unclear about the regression formulation in the manuscript.
    7. Comparing GPU memory footprint (which is critical in 3D medical image analysis) with other state-of-the-art methods will help understand model efficiency.
    8. How does the typical failure case in the prediction look like? What is the cause of the failure?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript presented a new neural network based method for anatomical landmark detection. But its technique is not sound. More explanation and description are required for further improvement.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    To adaptively encode the landmark’s global structure constraint into the representation of visual concepts and avoid large biases in landmark localization. This paper proposed CephalFormer, which exploits the correlations between visual concepts and landmarks to provide meaningful guidance for accurate 2D and 3D cephalometric landmark detection. By evaluation on two public datasets, experiments show that CephalFormer significantly outperforms the state-of-the-art methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The authors proposed a general Transformer-based framework that naturally handles both 2D and 3D scenarios for the landmark detection. (2) The authors studied and innovatively proposed a way to represent visual features and landmarks into a coherent feature space to explicitly incorporate the global structure constraint for accurate cephalometric landmark detection. (3) The method in this paper outperforms the state-of-the-art methods on two public cephalometric landmark detection benchmarks and a real-patient dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No obvious weaknesses found for this paper.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    No code were made public in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Except for the landmark detection, can the LGA and GGRA be used for other purposes? For example, use it for the normal semantic segmentation? I think the author should give some discussion of this topic.
    2. The figure 3 seems quite unclear when for comparing with other methods for details. Either the author split it to two different figures and zoom in the landmark area to let the reader see clear how the CephalFormer make it better.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper has quite strong innovations and also quite easy for readers to follow. A very nice paper to be accepted.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Strength

    • A method explores local-global anatomical constraints for general landmark detection

    Weakness

    • Motivation and methodological presentation can be better presented
    • Qualitative and quantitative comparison of the results obtained from the first stage with those from the second stage will be helpful to demonstrate the effectiveness of the proposed method.
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

  1. About the landmark embeddings. (Review 1, Question 1): The landmark embeddings are representative of the t possible landmarks in each input. They are learned from a linear embedding layer which takes features extracted from the coarse predicted heatmaps as inputs. The evaluation of the importance of the learned landmark embeddings is presented in Table 3 in supplementary material. We observed that performance drops significantly for both 2D and 3D scenes without using landmark embeddings.
  2. CephalFormer differs from previous methods in how to model global structural constraints. (Review 1, Question 2): Previous methods [p1, p2] use GNNs or additional modules to model landmark-to-landmark dependencies, however, cannot represent input features and landmarks in one coherent graph. A key aspect of CephalFormer is that the second-stage encoder can be viewed as a fully connected graph, which is capable of learning any relationship between visual features and landmarks (visual feature-to-landmark and landmark-to-visual feature dependencies).
  3. More explanations about the method. (Review 2, Question 1): The models in our method are trained end-to-end. (Review 2, Question 2): The first-stage model predicts heatmaps because direct coordinate regression requires the network to learn highly nonlinear functions, which is difficult for the coarse stage, while heatmap regression provides spatially richer supervision information. (Review 2, Question 3): The second-stage model takes high resolution patches as inputs and predicts the coordinate. If the first model misses the landmarks in the predictions, the second-stage model still can get them back in the final predictions because the second-stage model has direct access to the input images. In fact, the second-stage model can predict the coordinate even without the results of the first-stage model. (Review 2, Question 4): In experiments with 3D CT, all volumes are resampled to the same size. If the image shape is different from the model input shape, the image is resized to a specific shape. (Review 2, Question 5): In general, more CephalFormer blocks (more LGA and GGRA modules) do contribute to better model performance (see Table 5 in supplementary materials). (Review 2, Question 6): We model each ground truth landmark as a channel heatmap with a 2D Gaussian distribution centered at the landmark. In the circular area of one channel, the pixel values indicate appearing probability of the landmark. Please refer to [A1] for more details on how to model each ground truth landmark as a channel heatmap. Cross-entropy loss is also frequently used in landmark heatmap regression tasks [A1]. Cross-entropy is commonly used to quantify the difference between two probability distributions. In our regression formulation, we convert each ground truth landmark into a heatmap subject to Gaussian distribution, and then minimize the cross-entropy loss to make the predicted landmark heatmap distribution as close as possible to the ground truth landmark heatmap distribution. (Review 2, Question 8): The typical failure case in the prediction is mainly caused by incomplete input images, e.g., many landmarks are missing in input images. In such situation, the improvement brought by explicitly modeling global structure constraints is not obvious. Semi-supervised/self-supervised approaches may help solve such extreme situation and this will be our future work.
  4. Thanks to all the reviewers for their suggestions, they gave us a lot of inspiration on how to further improve our work.

[A1] An attention-guided deep regression model for landmark detection in cephalograms. (MICCAI 2019)



back to top