Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Thanaporn Viriyasaranon, Serie Ma, Jang-Hwan Choi

Abstract

Accurate localization of anatomical landmarks has a critical role in clinical diagnosis, treatment planning, and research. Most existing deep learning methods for anatomical landmark localization rely on heatmap regression-based learning, which generates label representations as 2D Gaussian distributions centered at the labeled coordinates of each of the landmarks and integrates them into a single spatial resolution heatmap. However, the accuracy of this method is limited by the resolution of the heatmap, which restricts its ability to capture finer details. In this study, we introduce a multiresolution heatmap learning strategy that enables the network to capture semantic feature representations precisely using multiresolution heatmaps generated from the feature representations at each resolution independently, resulting in improved localization accuracy. Moreover, we propose a novel network architecture called hybrid transformer-CNN (HTC), which combines the strengths of both CNN and vision transformer models to improve the network’s ability to effectively extract both local and global representations. Extensive experiments demonstrated that our approach outperforms state-of-the-art deep learning-based anatomical landmark localization networks on the numerical XCAT 2D projection images and two public X-ray landmark detection benchmark datasets. Our code is available at https://github.com/seriee/Multiresolution-HTC.git.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_42

SharedIt: https://rdcu.be/dnwJX

Link to the code repository

https://github.com/seriee/Multiresolution-HTC.git

Link to the dataset(s)

N/A


Reviews

Review #3

  • Please describe the contribution of the paper

    The main contribution of this research manuscript is to combines the CNN and Vision transformer model to perform landmark detection by extracting local and global representations. The state-of-the-art study given with deep localization frameworks and compared with the proposed model which are experimented with 4D XCAT phantom CT dataset and two X-ray public datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Authors have done a comprehensive study on landmark detection on CT and X-ray image datasets and proposed a framework by extracting low- and high-level representations by combining the advantages of CNN and vision transformer models. The proposed framework reported improved performance of the landmark detection and achieved less mean radial error and high successful detection rate when compared with the other models.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This manuscript has limitations and lacks the novelty of the proposed framework novelty, which is necessary to be considered for publication. Similar publications related to the Hybrid Transformer-CNN model already exist and have been applied in many medical image or some other domains such as Landmark detection and segmentation (see below references). These indicate that the current study is an application of already existing methods. Since there is no CAI component in the paper, the MIC component asks for novelty of the methods.

    1. Yueyuan and W. Hong, “Swin Transformer Combined with Convolutional Encoder For Cephalometric Landmarks Detection,” 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 2021, pp. 184-187, doi: 10.1109/ICCWAMTIP53232.2021.9674147.
    2. Jun, Tae Joon, et al. “T-net: Nested encoder–decoder architecture for the main vessel segmentation in coronary angiography.” Neural Networks 128 (2020): 216-233.
    3. Yu, Xiang, Feng Zhou, and Manmohan Chandraker. “Deep deformation network for object landmark localization.” Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer International Publishing, 2016.
    4. Saleh, Alzayat, et al. “A lightweight Transformer-based model for fish landmark detection.” arXiv preprint arXiv:2209.05777 (2022).
    5. Lu, Zefeng, Ronghao Lin, and Haifeng Hu. “MART: Mask-Aware Reasoning Transformer for Vehicle Re-Identification.” IEEE Transactions on Intelligent Transportation Systems (2022).
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    While datasets used in this study for experiments are publicly available, the hybrid transformer-CNN algorithm is not mathematically expressed or made available to the reader/reviewer for checking. This definitely reduces the reproducibility of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    I have following comments and suggestions 1) Requesting authors to explain in what way this manuscript is novel with reference to the below paper

    “A. Yueyuan and W. Hong, “Swin Transformer Combined with Convolutional Encoder For Cephalometric Landmarks Detection,” 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 2021, pp. 184-187, doi: 10.1109/ICCWAMTIP53232.2021.9674147.”

    2) Introduction needs more clarity on objective and list of contributions 3) In methods, architecture illustration in MSE are involved but not involved in the comparative study. Is there any specific reason? 4) I suggest the author to include the mathematical expression of the algorithm of the proposed model in methods 5) In Table 1 & 2, instead of ‘ours’ try to include the proposed algorithm name 6) Suggesting the authors to explain the limited results with respective towards 2mm and 4mm for (Table 1) for X CAT CT and ISBI’2023 dataset respectively. 7) What are the limitations of this study? 6) Conclusion is generic, you need to conclude in a concrete way with evidence of results. Include future scope as well. 8) Abstract needs to be rewritten concisely. Include the overall performance of the experimental results. 9) Sentence structures of this manuscript are not comprehensible at times which makes it difficult to understand. Requires thorough language editing.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main drawback of this paper is it lacks explanation on the novelty of the method used in comparison with existing published methods. The proposed methods that authors claim as novelty are having similarity with published papers and thus require proper justification. With these considerations I strongly reject this manuscript.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper a hybrid transformer CNN architecture (HTC) for 2D landmark detection task. A key idea is to use transformer architecture across scales in the image encoder, and subsequently decode heatmaps at multiple scales, each with an MSE loss, and combining the multi-scale heatmap to estimate the landmark coordinates. Experiments are conducted on 3 datasets including 2 public X-ray landmark detection benchmark datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • While several landmark detection architectures have been proposed (with and without transformer), the proposed architecture to combining the transformer and CNN across scales in a DenseUNet like architecture seems novel.
    • State-of-art performance is reported on ISBI2023, hand as well as XCAT CT dataset.
    • Ablation study is performed to show improvements attributed to the transformers in encoder and multi-scale/resolution heatmap.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Presented architecture simply averages the heatmaps at multiple resolutions. While the multiresolution is intended to help increase the precision, its unclear is averaging heatmaps across scale can also result in ambiguity, especially since there is no loss on the aggregate heatmap. Authors should consider including cases where landmark detection didn’t perform as expected and characterize the strengths and limitations of the architecture. Including results from only using one of the heatmaps vs combined would also help.
    • Proposed architecture processes the images at pre-defined 5 scales, even though use of the transformer is expected to capture the global information. It would be helpful to include a discussion and possibly experiments on the choice of the scales and reasons governing this choice. Will the performance drop with fewer scales? This would also be valuable to assess the applicability of the proposed method in other settings. E.g. in memory efficient settings, smaller models with fewer resolutions might be of interest, unless they drop performance. For very large images, will there be a need to add additional scales or might transformers in existing scales be sufficient?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors have indicated that the code will be released upon acceptance. This in conjunction with the observation that the model is evaluated on public benchmark would significant help with reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    This manuscript would benefit by addition of a discussion on what aspects of different benchmarks might have contributed to the improved performance of the proposed model. For instance, UNet has a high performance on XCAT but lowest in ISBI2023; however proposed method does well in both. Can this be attributed to certain aspects of the anatomy that the proposed method is able to capture, while others might be limited.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Even though a multitude of landmark detection models exist, the proposed method for landmark detection is potentially of interest particularly due to the reported performance improvements. There are a few concerns regarding the heatmap aggregation as well as lack of discussion on the choice of architecture scales/resolutions makes it difficult to assess the potential impact of the approach in other MIC settings.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #5

  • Please describe the contribution of the paper

    The paper proposes a hybrid transformer-CNN based multiresolution learning method for anatomical landmark detection. Experiments on several datasets show the effectiveness of the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Please see the detailed comments.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Please see the detailed comments.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    n/a

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. The main contributions of the paper can be summarized at the end of the introduction.
    2. Only two resolutions are employed in the proposed method, but the authors claim it is multiresolution. And can more resolutions lead to better performance?
    3. Using only transformer or CNN should be compared in the experiments, instead of hybrid transformer-CNN model.
    4. The existing methods in Tables 1 and 2 are different. Moreover, these compared methods can be classified as transformer and CNN models, which would be helpful to understand.
    5. It is suggested to improve the presentation of this paper. For example, the figures are with low-quality.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see the comments.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    The authors seems not to address my comments.



Review #6

  • Please describe the contribution of the paper

    In this study, the authors present a new feature extraction architecture, referred to as the hybrid transformer-CNN (HTC), along with multiresolution heatmap learning for automatic anatomical landmark detection. The HTC architecture comprises of multiple stages of stacked transformer modules, which incorporate a bilinear pooling attention module to capture the global information of images and convolutional modules to extract local and specific feature representations relevant to landmarks. Additionally, the authors introduce multiresolution heatmap learning to improve the network’s ability to capture global and local representations more accurately than learning from a single heatmap resolution, and thereby enhancing network localization.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    In the reviewer opinion the main strengths are two: -The feature extraction architecture, referred to as the hybrid transformer-CNN (HTC), along with multiresolution heatmap learning for automatic anatomical landmark detection -The results obtained, outperforming those of the state-of-the-art

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -More details could be given for the HTC structure and its stages. It seems that many parameters were empirically chosen (number of layers, sizes) and the selection of its parameters could have been better explained.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The author is satisfied with the information provided by the authors.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    -More details could be given for the HTC structure and its stages. It seems that many parameters were empirically chosen (number of layers, sizes) and the selection of its parameters could have been better explained.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In the reviewer opinion the main strengths are two: -The feature extraction architecture, referred to as the hybrid transformer-CNN (HTC), along with multiresolution heatmap learning for automatic anatomical landmark detection -The results obtained, outperforming those of the state-of-the-art.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This work is on anatomical landmark localization using a hybrid CNN and transformer based network design. It provides an evaluation of the proposed method on several datasets, which are publicly available, and performs an extensive comparison with other methods as well as some ablation studies, demonstrating a performance improvement.

    The work is well written and good to follow. The evaluation is thorough and uses various datasets and several competing methods to demonstrate the improvement in performance. Results compared with state of the art methods looks very promising. If authors publish their code, reproducibility will be possible.

    From the reviews and my assessment, a number of shortcomings have to be mentioned, which need discussion/clarification in the rebuttal:

    • There is a strong argument regarding the novelty of the proposed methodology, claiming that the proposed architecture and idea of combining CNNs and transformers is very similar to several published papers, and especially similar to the work “Swin Transformer Combined with Convolutional Encoder For Cephalometric Landmarks Detection”. Authors need to clarify this and give a response on how the proposed architecture is a novel contribution to the literature.
    • It is recommended to more clearly state the contributions towards the end of the introduction, such that these become clear after discussing the related work.
    • The authors propose a multi-resolution approach for landmark localization, however, as reviewers mention, there seem to be several sources of multi-resolution hidden within the architecture. Transformers operate on global context, so it is not clear why they are used in all stages of the 5 stage UNet design. The UNet via downsampling of feature maps also implicitly performs multi-resolution processing, so it is not clear how redundant the individual, per-UNet stage introduced Transformer stages are. From ablations this can not be seen easily either. Finally, there is the averaging of final predictions at two different scales, which seems to introduce multi-resolution processing. Again, it is not clear why this is done (the averaging of the finest detail heatmap prediction via the loss might get worsened by using the coarser level heatmap). It seems here a number of ablations are still missing to justify the final network design.
    • Reviewers comment that the details of the HTC architecture are not given in enough detail. Supplementary material or a code release might be required for reproducibility.
    • Regarding the comparison with other methods, it is not clear how fair this comparison is in terms of network parameters and compute/memory considerations. Adding the transformer stages into the UNet hierarchy (effectively replacing one convolution layer per feature scale with a transformer) undoubtedly will increase the model parameters signifcantly. To better judge the comparison to state of the art methods, it has to be clear at which order of magnitude the number of trainable parameters changes, and which effects this has on train time/inference time and memory consumption. While I would not consider it as a drawback to have higher compute/more memory requirements if the downstream application benefits from it, it still has to be transparent what is the cost of the performance improvement (if there is any).

    Please consider these remarks as well as further reviewer comments for your rebuttal.




Author Feedback

We appreciate the constructive comments from all Reviewers and the Meta-Reviewer.

(Meta-review 1, R3) The HTC differentiates itself from prior approaches by integrating a distinctive architectural design that merges CNNs and transformers. Unlike earlier methods, which either substitute the convolutional operation with a transformer-based module (as in ‘CephalFormer: Incorporating Global Structure Constraint into Visual Features for General Cephalometric Landmark Detection’) or carry out feature fusion between the output features of each stage from the CNN and transformer backbone (as seen in ‘Swin Transformer Combined with Convolutional Encoder For Cephalometric Landmarks Detection’), our HTC introduces a stack of convolutional and transformer modules, applied to all stages of the encoder. 1) At each stage, the transformer module is tasked with extracting global information, while the convolutional module isolates local information. 2) We propose a lightweight, positional-encoding-free transformer module. Instead of utilizing multi-head attention, we introduce bilinear pooling operation to capture second-order feature statistics and generate global representations. 3) Standard transformer encoders struggle with the fixed resolution of positional encoding, leading to decreased accuracy when interpolating the position encoding during testing with resolutions different from the training data. To mitigate this issue, we omit the position encoding from the transformer modules and utilize a 3x3 convolutional operation as the patch embedding to grasp location information and generate low-resolution fine features for the hierarchical encoder architecture design. Additionally, we integrate the HTC with the proposed multi-resolution heatmap learning.

(Meta-review 2 and 4, R5, R6) We have incorporated a synopsis of the main contribution at the conclusion of the introduction in our manuscript and have included additional information about the detailed operation of the proposed HTC encoder. Moreover, we have made the code available on GitHub (https://github.com/seriee/Multiresolution-Learning-based-Hybrid-Transformer-CNN-Model-for-Anatomical-Landmark-Detection) to guarantee reproducibility.

(Meta-review3, R4) The proposed HTC is crafted to generate a multi-level feature representation with both high- and low-resolution features. We introduce the transformer module to all four stages of the encoder to amplify the global feature extraction ability for both high- and low-resolution features at each stage of the hierarchical encoder architecture. Generally, global information such as the geometric relation between landmarks is contained in the high-resolution coarse feature, while local information such as area information for each landmark is present in the low-resolution features. To predict the landmarks’ location accurately, it is vital to incorporate both global and local information. Therefore, we compel the detector to learn both global and local information from the heatmap generated by the high-resolution coarse feature and the low-resolution fine-grained features, through the weighted summation of the heatmap loss from each resolution heatmap. Furthermore, the landmark coordinates derived from a high-resolution heatmap exhibit high bias and low variance, whereas the landmark coordinates obtained from a low-resolution heatmap demonstrate low bias and high variance. Hence, we average the predicted coordinates from the high- and low-resolution heatmaps to balance the bias and variance of the predicted landmarks at inference.

(Meta-review 5, R3) With the lightweight transformer architecture design, the number of parameters in the proposed method (16.2 M) is significantly lower than that of UNet (35.23 M). To demonstrate a fair comparison with state-of-the-art methods, we have included the number of parameters for the proposed method relative to other methods in the manuscript.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    With their rebuttal, authors have addressed the main concerns raised in the reviews. Although the difference to related work is not large, the stacked convolutional and transformer based network parts still show improvements in the experimental evaluation, which is comprehensive. The fact that the code will be published is excellent for reproducibility reasons. The concern that the number of additional parameters makes the comparison unfair to some degree could be clarified by the authors, in fact, it seems the number of parameters is lower than expected. Overall, this work might be a small step further in improving the state of the art in landmark localization, therefore I tend to vote for acceptance of this work.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    As pointed by the reviewrs, this study has some strengths in the performance, but also some weaknesses, e.g., in term of novely, or ablation experiments. Unfortunately, at this time, the study doesn’t reach the bar for acceptance in my opinion, however I invite the authors to expand their study with the suggestions of the reviewers and also take it to the next level in terms of using these landmarks in some clinical application (e.g. addressing the question, how are these landmarks serving a clinician, and does it matter if the landmark is 1mm off).



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Although the authors provided their rebuttal, none of the reviewers changed their original scores. However, the final score is still among the ones on the higher-side in my pool.



back to top