Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Heqin Zhu, Quan Quan, Qingsong Yao, Zaiyi Liu, S. Kevin Zhou

Abstract

One-shot medical landmark detection gains much attention and achieves great success for its label-efficient training process. However, existing one-shot learning methods are highly specialized in a single domain and suffer domain preference heavily in the situation of multi-domain unlabeled data. Moreover, one-shot learning is not robust that it faces performance drop when annotating a sub-optimal image. To tackle these issues, we resort to developing a domain-adaptive one-shot landmark detection framework for handling multi-domain medical images, named Universal One-shot Detection (UOD). UOD consists of two stages and two corresponding universal models which are designed as combinations of domain-specific modules and domain-shared modules. In the first stage, a domain-adaptive convolution model is self-supervised learned to generate pseudo landmark labels. In the second stage, we design a domain-adaptive transformer to eliminate domain preference and build the global context for multi-domain data. Even though only one annotated sample from each domain is available for training, the domain-shared modules help UOD aggregate all one-shot samples to detect more robust and accurate landmarks. We investigated both qualitatively and quantitatively the proposed UOD on three widely-used public X-ray datasets in different anatomical domains (i.e., head, hand, chest) and obtained state-of-the-art performances in each domain. The code is at https://github.com/heqin-zhu/UOD_universal_oneshot_detection.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_3

SharedIt: https://rdcu.be/dnwb0

Link to the code repository

https://github.com/heqin-zhu/UOD_universal_oneshot_detection

Link to the dataset(s)

N/A


Reviews

Review #3

  • Please describe the contribution of the paper

    Authors present a 2 stage network designed to solve multi-domain one shot landmark detection problems. Network I consists of a self supervised (contrastive) model that, once trained, provides pseudo labels for Network II. Network I follows the work of Yao et al. Network II, also follows the work of Yao et al, but with some modifications. Network II is a supervised network. In the work of Yao et al., Network II consists of a multi-task Unet, which, in this manuscript, is swapped with a multi-task multi-resolution Transformer network. The novel technical contribution of this work is the DATB. In brief, the DATB swaps a standard Query network in each transformer block with a ‘multi-domain’ Query network (essentially a repeat copy of Query network with a multiple of N, where N is equal to the number of domains). Additionally, a new residual input is added to outputs from QKV and MLP operations in the transformer, that follow ‘LayerScale’ design (allows deeper transformers to be built).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • concise and well written work
    • good amount of detail is provided on implementation on every aspect (except for “init heatmap” description, but I suspect lack of space)
    • sound results. strong improvement over multi-task Unet (aka CC2D, aka Yao et al. work ). I liked the ablation study too. It is clear that multi-domain QKV (aka MSA_{Q_d}) is the key element that brings transformer performance to decent level. But it is the addition of ‘LayerScale” that most likely allows the network to then surpass over multi-task Unet (given the incremental improvement).
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • the authors do not make it clear in the paper that they follow the work of Yao et al almost identically, except for the Transformer architecture (which replaces the multitask Unet). Please correct me if i misunderstood (did not read Yao et al in full but that is my understanding from skimming that paper). This then raises the question of why the DATB adapted multi-stage Transformer performs better than the multi-task Unet in the work of Yao et al.
    • missing information: not clear how inferrence is performed on patches and what is the dimensionality of inputs during inferrence (while not required, a supplementary could have easily contained all this necessary information).
    • no clinical feasibility / applicability discussed.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    As mentioned above, the paper is well written, particularly in terms of reproducibility. I would ask the authors to please share the code in the rebuttal as anonymous github account, as a good practice.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • as per major weakness above, you need to make it very clear to the reader what is your contribution here over Yao et al [22]. It only became clear to me after a thorough pass through the paper that the main technical contribution is DATR (with main focus on DATB), with all other variables, including the datasets, constant. You need to say this explicitly.
    • you need to make it clear that this work only verifies results on Xray datasets. Either in the title or the abstract. It hasn’t been tested against other domains.
    • In “Overall pipeline” on Page 5, I did not understand why do you need to set initial heatmap with Gaussian. In fact, this entire paragraph is a little confusing. I suggest you provide a supplementary for this to make it clearer.
    • You could explain more about inferrence. For example, it wasn’t clear how you process data in patches. Secondly, I wasn’t sure whether one would always need to provide ‘3 images’ on input to the transformer from 3 different domains. How does this work? Could you clarify? (again - in supplementary if lack of space)
    • how did you decide on the number of stages for DATR? do you have any prelim hyper parameter tuning results?
    • you should ideally show performance of Network I (without Network II), like Yao et al did. Or, direct the reader to Yao et al to say that this kind test had already been done on this data (CC2D-SSL in Yao et al).
    • why do you add Layer Scale in both places - after MSA and after MLP? Why not just after MLP? Do you have that in your ablation test results?
    • in 3.1. “effectivness of universal model”, why do you think ‘universal’ model outperforms ‘single’ highly specialized model? How many training samples were given to ‘single’ model? Is it a problem of ‘smaller’ dataset?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Presents a novel design for Stage II, and justifies design choices for how it adapts transformer for multi-domain task (both in terms of description and ablation study results). Shows improved performance over prior work from Yao et al.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This study proposed a multi-domain framework for anatomical landmark detection in one-shot. The framework to the reviewer’s knowledge is the first in the medical field where X ray datasets on head, hand, and chest are involved. The framework contains a siamese network for domain-specific feature learning, as well as a domain-adaptive transformer (DATR) to efficiently learn the shared features across different domains so as to facilitate landmark detection in one shot.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) .The design of DATR is novel. When calculating the attention score, the author did not scale/ normalize the query-key dot product according to the dimension of the key vector. Such practice was usually suggested so as to prevent the gradient vanishing during backpropgation. Instead, the author adopted the idea of LayerScale and made it domain-aware. 2). The experiments are comprehensive including the ablation study, and the results are strong.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It seems there is a correlation between the number of training samples per domain and the accuracy. For instance, chest dataset has the least training samples and the worst result. Reviewer wonders if the author can include more chest X rays, or have an ablation study trained with less head and hands images to if if that affect the results. In addition, different domains might have different criteria on the preciseness of the landmark detection; chest landmarks may have a larger tolerance of errors than head. Therefore, an evaluation metrics that taking into doman differences might be preferred and more meaningful in the application.

    The following are not really weakness but problems which if addressed would make the paper better: 1). Table 1 has the best results in bold. However, apparently fully supervised YOLO, as well YOLO with 25 labels presents better results, and YOLO with 10 has competing accuracy. It is misleading there, maybe separate the table? 2). The link provided on the page 6 is not working. 3). The author uses DATR and DATB (where B means block) exchangeablely but the reviewer think it would be better to keep the abbreviation consistent. 4). What is the sigma when evaluating SDR? 5). Table 2 could be made better if a, b, c, d are in order. Currently it is a,c,b,d.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The author checked yes to all the Reproducibility Response questions. The only concern is the chest dataset they use might not available since the link they provide does not work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    In addition to the points in the “weakness” part, the reviewer is interested in knowing how heterogenity within domain will affect the model. For example, combining chest X ray from other dataset which includes collapsed lungs, or brain images from taken from a different angle.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    merits weigh over weakness

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    o This paper proposed a two-stage framework that takes in multi-domain unlabeled data and predicts landmarks of objects (e.g., the head, the hand and the lung). The framework is expected to be robust across multiple domains (i.e., head, hand and chest images) and should not be biased by specific domain data. o The first stage of the framework uses contrastive learning to learn pseudo landmarks that are used in the second stage. In the second stage, the transformer-based method helps to capture domain-specific properties while mitigating the domain-specific bias. o This paper demonstrates the methods with public datasets, showing (1) that each of the proposed modules is useful and (2) that the overall framework outperforms a multi-domain model YOLO and a one-shot model trained in single domain CC2D.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    o The paper focuses on an important (one-shot/few-shot) field, i.e., how to utilize information from multiple domains in predicting landmarks. Here, one-shot samples are used as infer data (as opposed to training data with supervision) to compute the evaluation metric that indicates the “goodness” of the model. By doing so, the model is trained regardless of the quality of annotated template images. o The paper presents a novel method that addresses difficulties in one-shot learning from multiple domains. To this end, the paper introduces a universal model that takes advantage of common features (of landmarks) across domains. The methods are designed to extract domain-specific and common features. o The modification of the transformer block is also interesting. The authors proposed DATR (Domain-Adaptive TRansformer) module to separate domain-specific and common features across domain. The queries in the proposed transformer block are domain-specific, while the keys and values are shared across domains. It is hopeful that such a structure can both capture domain-specific and cross-domain features. o The comparison between the proposed model and the existing methods (i.e., YOLO and CC2D) suggests that the proposed model (1) learns useful information from multiple domains that improves performance of the single domain model (a model trained with single domain data and CC2D) and (2) achieves good performance that the SOTA multi-domain learning model (i.e., YOLO) can do with more labeled data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    o The authors failed to explain why predicting landmarks in a domain can help to predict landmarks in another domain. While the experiments suggest that the learning from different domains wins the learning from a single domain, it is unclear if this advantage can be generalized to other multi-domain data. Particularly, the authors didn’t explicitly explain/show why predicting landmarks in head images can help predicting landmarks in hand images and vice versa. Because landmarks in medical application are usually chosen at places that are of clinical interest with specific geometry properties, I am wondering if there is any assumption on landmarks or selected domains (e.g., head and hand) so that the proposed framework can be useful? o In stage 2 (section 2.2), the design of the proposed methods is not well-motivated. Especially the purpose of the learnable diagonal matrices. The paper nicely pointed out that the diagonal matrices are designed “to facilitate the learning of domain-specific features”. Given that Q_d is also designed to capture domain-specific features, is D_1 in eq. (2) redundant? o It is important to understand how to initialize the encoders. In section 3 the Implementation details paragraph, could you clarify “All encoders are initialized with corresponding pre-trained weights”?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    o The paper conducted experiments on public datasets. o The authors will release the code upon the acceptance.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    o To comprehensively show the advantage of the universal model over the single model, the authors can provide a plot (e.g., box plot) that show the statistics about MRE and SDR of all one-shot samples in each domain. o It can be a typo in Fig 2 (a) – I am curious why the second DATB from the bottom has significantly more blocks (x 18). Also, there are a few small issues that can be quickly fixed.  Page 2: the first sentence of the paragraph starting with “However,” It makes sense that “one-shot methods are not robust enough because they are dependent on …”  In the introductory paragraph of Section 2, “to learn the local appearance and of each domain” should be “to learn the local appearance of each domain”. o In the caption of Table 1, the last sentence says “the best results are in bold”. Strictly, this is not true considering YOLO with more labeled data. It would be good to refine this statement.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    o This paper contributes methodological improvement in the field of one-shot learning from multiple domains. The paper is well written. The methods and the framework are clearly illustrated. Moreover, the experiments are designed to support and has suggested the correctness of the argument of this paper – i.e., multi-domain learning helps in the tasks with few labeled samples.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposed a domain-shared universal one-shot detection of anatomical landmarks. The methodology consists of two stages and two corresponding universal models designed as domain-specific and domain-shared modules. The experimental results demonstrated better performance. The three reviewers also affirmed the merits of this paper. The issues include the explanation of the generalization to other multi-domain data, the explanation of “All encoders are initialized with corresponding pre-trained weights”, the explanation of the sigma when evaluating SDR, the explanation of the details of inference and some other implementation details mentioned by reviewers. Please address these concerns in the final version.




Author Feedback

We sincerely thank all reviewers and AC for their time and efforts. Per AC’s request, we address the main concerns from the reviews.

In the following, we address the reviewers’ specific concerns.

Reviewer 1: Q1: Why predicting landmarks in a domain can help to predict landmarks in another domain? A1: Even though some landmarks are chosen according to clinical interest, some landmarks from various domains have common knowledge such as corner points, and lightness border. Based on this, we develop domain-shared modules to capture common features from different domains and improve the learning of domain-specific features.

Q2: Given that Q_d is also designed to capture domain-specific features, is D_1 in eq(2) redundant? A2: D_1 is not redundant. Q_d is employed in self-attention while D_1 is employed before residual connection. They learn different depths of domain-specific features. Moreover, D_1 can boost the convergence of transformer networks according to [16].

Q3: Clarify “All encoders are initialized with corresponding pre-trained weights”. A3: Thank you for your suggestion. “All encoders” mean VGG in stage 1 and Swin Transformer in stage 2.

Reviewer 2: Q1: It seems there is a correlation between the number of training samples per domain and the accuracy. A1: Thank you for pointing out this interesting observation. Indeed head training set has 150 images while the chest training set has 197 images. The hand dataset achieves the best accuracy, which may benefit from the large training set of 609 images. The reason why the chest dataset obtains the worst accuracy may be the small number of landmarks.

Q2: An evaluation metric that taking into domain differences might be preferred and more meaningful in the application. A2: Thank you for this constructive suggestion. We will try such domain-aware metrics in future work.

Q3: Table 1 has the best results in bold. However, fully supervised YOLO, as well as YOLO with 25 labels presents better results, and YOLO with 10 has competing accuracy. It is misleading there, maybe separate the table? A3: With 25 labels, the MRE of the YOLO on hand dataset is 2.88 mm while 7.03 mm for the chest dataset. With 10 labels, they are 9.70 mm and 16.07 mm respectively. YOLO performs better with 25 labels under most metrics, which is comparable to UOD with 1 label.

Reviewer 3: Q1: You need to make it very clear to the reader what is your contribution here over Yao et al [22]. A1: Thank you for pointing out this. As the last paragraph of the “Introduction” stated, our contributions have three parts: (1) We develop a universal model for multi-domain one-shot scenario while Yao et al aim at single-domain one-shot learning. (2) We develop domain-adaptive transformer DATR to learn domain-specific knowledge and to build global context information of multi-domain landmarks, while Yao et al use a vanilla U-Net in stage II. (3) Our UOD surpasses CC2D on every dataset.

Q2: You could explain more about inferrence. A2: During training and inference, all datasets are mixed batch by batch and processed batch-wisely in each epoch.

Q3: In 3.1. “effectivness of universal model”, why do you think ‘universal’ model outperforms ‘single’ highly specialized model? A3: The universal model can promote the learning of domain-specific knowledge with the help of the learning of domain-shared knowledge. More experiments should be carried out to verify if it is a problem of a ‘smaller’ dataset.

Other details:

  • The number of DATB blocks (e.g. “x 18 DATB”) is adopted from Swin Transformer.
  • The link on page 6 is the official link of SCR for JSRT annotations.
  • SDR is evaluated according to predicted landmarks and ground truth landmarks, with no need of sigma.
  • “a, c, b, d” in Table 2 are typos and We will correct them.
  • The initial heatmap is used to employ an exponential to obtain the final ground truth heatmap. Maybe we can combine the two steps.



back to top