Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Haibo Jin, Haoxuan Che, Hao Chen

Abstract

Recently, anatomical landmark detection has achieved great progresses on single-domain data, which usually assumes training and test sets are from the same domain. However, such an assumption is not always true in practice, which can cause significant performance drop due to domain shift. To tackle this problem, we propose a novel framework for anatomical landmark detection under the setting of unsupervised domain adaptation (UDA), which aims to transfer the knowledge from labeled source domain to unlabeled target domain. The framework leverages self-training and domain adversarial learning to address the domain gap during adaptation. Specifically, a self-training strategy is proposed to select reliable landmark-level pseudo-labels of target domain data with dynamic thresholds, which makes the adaptation more effective. Furthermore, a domain adversarial learning module is designed to handle the unaligned data distributions of two domains by learning domain-invariant features via adversarial training. Our experiments on cephalometric and lung landmark detection show the effectiveness of the method, which reduces the domain gap by a large margin and outperforms other UDA methods consistently.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_66

SharedIt: https://rdcu.be/dnwdN

Link to the code repository

https://github.com/jhb86253817/UDA_Med_Landmark

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    The paper addresses unsupervised domain adaptation for cephalometric landmark detection. The authors make three contributions to solve the problem: 1) a novel landmark detection model to jointly perform coordinate regression and obtain confidence estimates, 2) a self-training scheme, which performs confidence-based pseudo label selection with landmark-specific dynamic thresholds, 3) adversarial feature alignment. Evaluation is performed on two public datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This is the first work on domain adaptive landmark detection. This is a very relevant task and the authors provide a benchmark to the community, which is of interest for future works.
    • The paper is clearly written and easy to follow in all parts. The authors provide almost all necessary information to understand the method and evaluation. Moreover, both the problem itself and the proposed methods are well motivated.
    • Adequate comparison with four recent SOTA UDA methods, demonstrating a strong performance of the proposed method. Ablation experiment verifies the contribution of the proposed adaptation strategies.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Human/animal pose estimation can also be considered a landmark detection task, and there are diverse UDA methods in the literature, e.g. [a,b,c] that are not discussed. In particular, the consistency-constrained self-training in [a] appears very similar to the authors’s self-training strategy. Given that adversarial domain alignment is adopted from [10], the technical novelty of the proposed DA methods is rather limited.
    • The proposed self-training strategy (LAST) alone (Tab. 2) is inferior to the competing AT method [21] (Tab. 1), which is also a self-training method. The SOTA performance by the authors’ method is thus only due to the combination with domain alignment (DAL), which, in turn, questions the choice of LAST. Wouldn’t it be better to combine AT with DAL?
    • The authors argue that coordinate regression methods cannot output confidence estimates, motivating the design of a novel model. However, confidence can be determined with diverse methods that are also applicable to regression models, e.g. Monte-Carlo dropout, predictive variance under input perturbations or between different network heads. In this light, the design of the novel model appears unnecessary or should rather be motivated by an ablation experiment, demonstrating the superiority of the proposed confidence estimates over the above-mentioned approaches. But since I don’t consider the model itself as the major contribution, this is not decisive as long as all comparison methods employ the same model.

    [a] J. Mu et al.: “Learning from synthetic animals”. CVPR. 2020. [b] W. Yang et al.: “3d human pose estimation in the wild by adversarial learning”. CVPR. 2018. [c] A. Bigalke et al.: “Domain adaptation through anatomical constraints for 3d human pose estimation under the cover.” MIDL. 2022.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors used public datasets and state to release their code. Thus, results should be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The SOTA competitors are mainly self-training methods (+FDA). It would be interesting to compare to different strategies such as image-to-image translation (CycleGAN) or adversarial output space adaptation [b].

    The following technical and implementation details should be included in the camera-ready:

    • How exactly is the confidence inferred from the score maps?
    • Is the model randomly initialized after each round of self-training?
    • How are the batches composed? Half source, half target?
    • The target dataset stems from 7 different devices, indicating potential domain gaps within the target domain. Potential consequences for the evaluation should be discussed.
    • Are all comparison methods implemented based on the same detection architecture?

    Typos:

    • “There are also other works explored …” –> Is there missing a word?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the technical novelty of the method is partially limited and some related work still needs to be discussed, the paper provides a strong benchmark for the previously unexplored task of domain adaptive landmark detection, which is of high relevance to the community in my view.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The authors clarified most of my concerns and included several convincing ablation experiments to address concerns by other reviewers. Even though the technical novelty of the paper remains limited, benchmarking a novel problem (UDA for landmark detection) and strong performance in a comprehensive evaluation make the paper an interesting contribution to MICCAI.

    I strongly recommend the authors to discuss the differences between their work and the mentioned related works in the final version, as promised. I also recommend to discuss potential effects of domain shifts within the target data.



Review #3

  • Please describe the contribution of the paper

    This paper proposes a framework for anatomical landmark detection under the setting of unsupervised domain adaptation. This framework aims to transfer the knowledge from a labeled source domain to an unlabeled target domain, which can help address the problem of domain shift. The framework includes a base landmark detection model, a landmark-aware self-training (LAST) strategy, and a domain adversarial learning (DAL) module. The proposed method was evaluated on cephalometric landmark detection and showed a significant reduction in mean radial error and an improvement in the success detection rate.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    In my opinion, the global-local schema of landamrk detection model and the domain adversarial learning are common wisdoms. The key contribution is the landmark-aware self-training strategy (LAST). LAST is landmark-aware and selects reliable landmark-level pseudo-labels of target domain data with dynamic thresholds, which can help reduce the impact of conformation bias and improve the adaptation effectiveness.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper does not provide a detailed analysis of the computational complexity and model parameters. The landmark detection model contains a convolutional network and transformer network, which may costs too much resource.
    • The ablation study does not contain the key contribution, i.e., confidence mask in LAST.
    • The proposed method was only evaluated on cephalometric landmark detection, and it is unclear how well it would perform on other modality of medical imaging tasks or in other domains.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The image datasets are publicly available.
    • Parameter details are specified but no link to code is provided.
    • Section 2 and Section 3.1 are informative enough to facilitate reproduction.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The authors may want to conduct more ablation study on: 1) w/, w/o confidence mask; 2) w/, w/o dynamic thresholds; 3) self-training round t=1, 2, 3, 4, 5 (same as [3])
    • The computational complexity and model parameters should be reported.
    • The authors may want to conduct experiments on other anatomical regions and modalities.

    [3] Cascante-Bonilla, P., Tan, F., Qi, Y., Ordonez, V.: Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning. In: AAAI (2021).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The poposed method contains three parts: 1) global-local landamrk detection model; 2) domain adversarial learning; 3) landmark-aware self-training strategy. 1) and 2) are not hard to found in the related literatures while 3) is the key contribution. In 3), it seems that the dynamic thresholds is a kind of curriculum learning similar to [1]. Hoever, this paper does not carry out experiments about confidence mask and dynamic thresholds in different self-training round while [3] does.

    Nevertheless, this paper is well-organized and well-written, which makes the first attempts in UDA of anatomical landmark detection. Moreover, the performance is satisfying.

    [3] Cascante-Bonilla, P., Tan, F., Qi, Y., Ordonez, V.: Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning. In: AAAI (2021).

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The adequate experiments mitigate my concerns.



Review #4

  • Please describe the contribution of the paper

    This work aims to mitigate the performance drop for the landmark detection task in the unsupervised domain adaptation setting. A self-training pipeline with dynamic threshold on the landmark level is proposed. It also takes adversarial training to learn domain-invariant features. Transformer architecture is also exploited to boost performance of landmark detection.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • To mitigate the issue of domain gap in the landmark detection, instead of image level, it proposes self-training with dynamic threshold at the landmark level.
    • It also involves transformer architecture to boost the performance of landmark detection.
    • The paper is well-organized.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The landmark detection module contains global localization, which adopts transformer architecture. In table 2, it would be better to also have ablation study on it.
    • In the self-training part, does it take the same network, or train a new network on images with pseudo labels in the target domain, as well as the labeled source data?
    • Technically, the modules (domain adversarial learning, self-training and dynamic threshold for the pseudo labels) are already explored in the previous works. Self-training on landmark level may be interesting.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper has provided most of the details.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    It would be better to provide more details for local refinement. What is architecture in this part? How to take f and coordination from the global localization part?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper aims to do the anatomical landmark detection in the UDA setting. The overall techinical novelty is marginal. The specific design in deteting landmark and self-training in the landmark-level maybe interesting. At the current stage, I am actually on the borderline.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    After going through comments from other reviewers, especially R3, I still have concerns about the limited novelty. Domain adversarial learning, self-training framework and dynamic threshold have been explored before. Although they are combined and applied in the specific field of landmark detection, showing a superior performance, the novelty remains in doubt.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposed a novel domain adaptive landmark detection task and a practical solution. The effectiveness is shown in two public datasets.

    It receives a mixture ratings from reviewers. In rebuttal, the authors should clarify some technical details, e.g., How exactly is the confidence inferred from the score maps, local refinement and and dynamic thresholds. In addition, seems the self-training with binary mask for regression task is also published in [a].

    [a]Generative self-training for cross-domain unsupervised tagged-to-cine mri synthesis. In MICCAI 2021.




Author Feedback

We appreciate the valuable comments of all the reviewers. As agreed by the reviewers, our work establishes the first medical landmark detection benchmark in UDA and the proposed landmark-aware self-training (LAST) is an interesting idea. However, there are also concerns and suggestions, which we address individually as follows.

Reviewer #2: Q1. More UDA papers We will discuss them in the next version. We also highlight the key difference between our LAST and the method from [a]: LAST applies curriculum learning to the landmark-level selection for fine-grained learning while [a] focuses on image-level.

Q2. Combine AT and DAL AT actually utilized both self-training and DAL. Thus, combining AT and DAL will not improve result while combining LAST and DAL further improves and outperforms AT.

Q3. Confidence score. Confidence estimation methods are indeed alternatives to our base model. However, more investigations are needed to verify their effectiveness in this task.

Q4. Technical details. a) CycleGAN has already been compared as it is a module of UMT in Table 1. b) About the confidence from score maps, please refer to Q3 of Reviewer #4. c) Models are randomly initialized after each self-training round. d) The batches are randomly sampled since we already make sure the number of source and target domain samples are equal. e) It is indeed interesting to see the differences between the 7 subdomains. We will do it in the next version. f) Yes, all comparison methods were based on the same architecture.

Reviewer #3 Q1. More ablation study. The MRE and SDRs of vanilla self-training (ST) are: ST: 2.18 (MRE), 62.18 (2mm), 69.44 (2.5mm), 75.47 (3mm), 84.36 (4mm) By adding confidence mask, we have: ST+Mask: 1.98, 65.34, 72.53, 78.03, 86.11 By further adding dynamic threshold, we have: ST+Mask+Dynamic: 1.91, 66.21, 74.39, 80.23, 88.42

The MRE of ST at each round is: 2.31 (iter1), 2.20 (iter2), 2.19 (iter3), 2.18 (iter4), 2.18 (iter5). For LAST (i.e., ST+Mask+Dynamic), the MRE is: 2.10, 1.97, 1.94, 1.92, 1.91

Q2. Model complexity. Our model has 41M parameters and 139 GFLOPs when input size is 800x640.

Q3. Results on other parts. We add experiments on lung landmark detection based on the data released by [1]. There are 94 landmarks in total for left and right lungs. We use the 247 images from the JSRT set as source domain and the rest three sets as target domain (665 images). 70% of target domain are unlabeled and 30% are test set. The results of MRE and SDRs are as follows: Base, Labeled Source: 7.55, 14.22, 20.55, 26.73, 38.73 FDA: 6.14, 15.51, 22.01, 28.82, 42.07 UMT: 5.82, 16.11, 22.92, 30.15, 43.87 SAC: 5.66, 16.69, 23.59, 30.86, 45.42 AT: 5.49, 17.28, 24.36, 32.19, 46.76 Ours: 5.34, 18.10, 25.82, 33.27, 48.02 Base, Labeled Target: 4.52, 26.47, 35.85, 44.99, 59.96

[1] Gaggion et al. Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis. TMI, 2022.

Reviewer #4 Q1. Ablation study on transformer. The proposed method has MRE and SDRs as follows: Ours: 1.75, 69.15, 76.94, 82.92, 90.05 When the base model is replaced by heatmap regression, we have: Ours w/ heatmap: 1.84, 66.45, 75.09, 81.82, 89.55

Q2. Self-training details. After each self-training round, the network is randomly initialized to avoid confirmation bias.

Q3. Local refinement details. Take one landmark as example. Based on feature map f, local refinement generates score map f_s and offset map f_o via convolution, both with shape 200x160 (i.e., HxW). Global localization generates a prediction (x, y). By calculating which grid (x,y) falls in for f_s and f_o, we know three things: 1) confidence of the prediction by extracting the value of the grid from f_s; 2) local offset of the prediction by extracting the value of the grid from f_o, denoted as (x_o, y_o); 3) center coordinate of this grid, denoted as (x_c, y_c). Finally, we have refined prediction as (x_c+x_o, y_c+y_o).




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposed a new task along with a novel solution. The rebuttal has partially addressed some concerns. The authors are encouraged to incoperate the new discussion and clarification in the final version.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper studied the problem of anatomical landmark detection and proposed an unsupervised domain adaptation method for transferring a source-domain model to the target domain. The method combines two existing methods self-training and adversarial learning to achieve DA. The method reported better results than the baseline methods. The rebuttal provided more details about the method and more results to support the method. The discussion about its contribution/differences to the existing method remains limited.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This work aims to address the challenge of domain shift in anatomical landmark detection by proposing a framework for unsupervised domain adaptation. The rebuttal has adequately addressed the major concerns of the three reviewers, including adding more technical details, ablation studies, and a comparison of other methods. Thus, this paper is recommended for acceptance.



back to top