Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jie Luo, Guangshen Ma, Nazim Haouchine, Zhe Xu, Yixin Wang, Tina Kapur, Lipeng Ning, William M. Wells III, Sarah Frisken

Abstract

Current registration evaluations typically compute target registration error using manually annotated datasets. As a result, the quality of landmark annotations is crucial for unbiased comparisons. Even though some data providers claim to have mitigated the inter-observer variability by having multiple raters, quality control such as a third-party screening can still be reassuring for intended users. Examining the landmark quality for neurosurgical datasets (RESECT and BITE) poses specific challenges. In this study, we applied the variogram, which is a tool extensively used in geostatistics, to convert 3D landmark distributions into an intuitive 2D representation. This allowed us to identify potential problematic cases efficiently so that they could be examined by experienced radiologists. In both the RESECT and BITE datasets, we identified and confirmed a small number of landmarks with potential localization errors and found that, in some cases, the landmark distribution was not ideal for an unbiased assessment of non-rigid registration errors. Under Discussion, we provide some constructive suggestions for improving the utility of publicly available annotated data.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16446-0_4

SharedIt: https://rdcu.be/cVRSK

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper identifies the very important issue of quality control for landmark points in public data sets used for the evaluation of the registration algorithms that underpin much of MICCAI. The paper describes a methodology to evaluate landmark point quality and demonstrates feasibility on two public data sets. The basic hypothesis is that variograms can be used to quickly identify suspect landmark points in the data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Quality control and the ability to independently evaluate quality is absolutely key to the deployment of public data sets for registration evaluation. This paper proposes and evaluates a novel method to evaluate fiducial point quality using variograms.

    I am not aware of prior use of this methodology, nor I am aware of other practical methods for the evaluation of fiducial point quality. The methodology described in the paper (with some exceptions, see weaknesses) could be applied to most data sets used for registration evaluation, and therefore is potentially a very valuable contribution to the field.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper has two significant weaknesses that I believe could be addressed quite easily.

    1. The paper does not adequately evaluate the authors’ key hypothesis that variograms can be used to rapidly identify suspect landmark points. The authors classify points into three groups (definitely suspect, maybe suspect, and OK). Samples of these points are then given to 2 independent radiographers. To test the hypothesis the paper needs to show that there is a statistically (and practically) significant difference in the number of suspect points confirmed by the radiographers in these samples. At present the paper does not show this.

    2. The first part of the point classification algorithm (construction of the variograms) is well described and I am confident it could be reproduced. The second part of the algorithm, (classification of fiducial points from the variograms) is less well described and I am not confident it could be reproduced. More details need to be given on the skills and training of the people doing the classification. More quantitative data could be given on this process.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper uses publicly available data sets. The authors do not provide the code required to construct the variograms, however their description of the algorithm used is sufficient for this. I am not confident of the reproducibility of the remainder of the classification process. As described it sounds like it may be quite operator dependent. I think the authors need to better formalise this part of the algorithm.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Quality control and reproducibility of data used for registration evaluation is of vital importance to the MICCAI community. I believe this paper has the potential to make a very significant contribution to the field. At present the paper has a very significant weakness in that it does not properly test the key hypothesis, however if the authors address this weakness I would change my recommendation to strong accept. If the paper included further data on reproducibility and accompanying software to enable deployment on other data I would make an even stronger recommendation.

    As detailed in my answer at 5, the paper needs to be slightly restructured and more experimental data added to properly test the hypothesis that variograms provide a reliable way to evaluate the quality of fiducial points used in these data. The authors need to provide a three arm experiment showing that points identified as suspect are identified correctly, i.e. there are significantly more suspect points in this sample than points identified as OK or potentially suspect.

    Further to that the authors should provide a more formal description of the algorithms used to classify points from the variograms so that the work could be reproduced or applied to other data.

    Other comments:

    1. Page 1, para 2: “FRE are uncorrelated [8], in practice, TRE is approximated by FRE, and these two terms are often used interchangeably.” You can’t claim this without some evidence. If people are using the terms interchangeably they are wrong and need to be put right. Putting this sentence in your paper risks people citing your paper as evidence that TRE and FRE are equivalent. I think you should just delete the second part of the sentence, or say something like “however a surprising number of researchers fail to understand this.”

    2. Page 2, para 1: “they may contain FLE”. I think you should reword this. The source of errors of localising a target and a fiducial are not necessarily related. I think you should try something like “they will contain a localisation error, i.e. the TRE is itself only an estimate of the true error”

    3. page 2 para 6. Maybe a bit more on why you’ve chosen the variogram over other methods? What other methods are there?

    4. Figures 2 to 6 encode really important information using only differences in colour. Please consider using non colour cues (i.e. different shapes like squares and crosses) and/or a colour blind friendly palette (https://doi.org/10.1038/nmeth.1618)

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has a significant weakness, however I think this could be very easily addressed during the revision and rebuttal stages.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The authors haven’t addressed my main concern in their rebuttal, however they have explained why they don’t think they can address my concern. This is a shame as I really do think the paper would be much better with a proper proof of the hypothesis. I disagree that they would have to check all 700+ landmark pairs to do this. It could be done by randomly sampling a statistically significant portion each set of landmark points. That said, the paper proposes a novel solution to a very important problem and after revision should not have any significant errors, so I have revised my opinion slightly.



Review #2

  • Please describe the contribution of the paper

    This paper presents a new efficient approach to test the quality of image fiducials in (medical) image data sets. The approach is quick, reliable and easy to use and interpret by applying the variogram from the geosciences.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A well written contribution that introduces the method to a new field and applies to the use case of fiducial / landmark selection within a clinical environment.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In the methods section, the pairs of image data sets are to be registered. Right? If so, the process of registration will affect the methodology here, as the quality of the registration depends on the choice of the landmarks. A vicious circle. If the pairs are not registered, how are the landmarks compared??

    The work of Bardosi et al on FLE_image, IJCARS, presents very relevant information for this type of research and should have been addressed.

    This method is more or less a qualitative approach to testing the “quality” of selected fiducials that is based on the assumption (end of section 2.1) that wider separated landmarks have larger differences in their displacement. First, this is not backed by any reference and second, it is counter-intuitive as it would imply a “bias” proportional to the distance.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Data will be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please address issues in point 5.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This contribution is providing a novel approach without really assessing the current state-of-the-art in assessing image fiducials, buids on unfounded assumptions and is somewhat unclear / imprecise in the description of the methods.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The authors propose a method to assess the quality of paired landmarks that were manually labeled in corresponding images of registration datasets. Variograms, an existing statistical tool, are used to create 2D representations of 3D landmarks distributions. These 2D variograms can be easily checked by an operator to identify potentially problematic landmark pairs, as these cases with localization errors yield specific patterns. The method is applied to two open-sources datasets of MR and intraoperative US images of the brain. Among more than 700 landmark pairs, 29 were identify as potentially problematic. After a review by third-party clinicians, the poor quality of these landmark pairs was indeed confirmed. The proposed method is thus an interesting tool to check/improve the annotations of publicly available datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The main strength of the paper is it topic itself: enhancing the quality of medical annotations. This topic is rarely addressed but is of prime importance as annotations are extensively used for training or evaluation of methods both in the MIC and CAI communities.
    • The “specificity” appears good: most (if not all? see details below) variograms tagged as problematic correspond to landmarks of poor quality. Once identified, these landmarks could thus be improved.
    • The methods is sound and clear, and several patterns of variograms are clearly described (isolated landmarks, clusters, …).
    • While not fully automatic, the proposed tool clearly fasten the control quality process.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • the main weakness is that the “sensitivity” of the method is unknown. Especially, how many landmarks of poor quality are not identified through the variograms? (missed cases) It is difficult to assess this, since all landmarks would have to be thoroughly checked by third-parties (or maybe by looking at the most challenges cases in challenges like [30]?). Nevertheless, this limit should at least be discussed in the paper.
    • they might be other patterns in variograms that were not identified.
    • it would have been interesting to list (some) other datasets that could be directly checked with your method.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The method is clearly detailed and could be easily reproduced. Providing the code would be a plus, so that future dataset providers could use it while checking their annotations.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • after review, average score [1-4] of Cat 1 (problematic): 1.4 after review, average score [1-4] of Cat 2 (atypical): 2.4 after review, average score [1-4] of Cat 3 (normal): ? For the problematic pairs, the average score of 1.4 ([1(poor), 2(questionable), 3 (acceptable), 4 (good)] generally confirms the variogram category. -> How many problematic landmarks were eventually noted 3 or 4, if any?

    • “Both datasets have manually annotated corresponding landmarks on pre-operative Magnetic Resonance (p-MR) and intra-operative Ultrasound (i -US) images [20, 30].” -> since the landmarks are part of the datasets themselves, better cite the original papers here? [16, 29]. The other papers are mostly stressing the use and importance of the annotations

    • Eventually less than 5% of all landmarks were of poor quality. These landmarks have to be improved, but this low figure eventually remain a good news!

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The major factor is the importance of the topic and the proposed method to address it.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes and evaluates a novel method to evaluate fiducial point quality using variograms. The method seems qualitative and biased towards distance as noted by reviewer. It addresses an important problem in making public datasets. Will the authors make source code public? The authors should provide a more details of the algorithms used to classify points from the variograms so that the work could be reproduced.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5




Author Feedback

We are glad that all reviewers found our work novel. They pointed out that this work “identifies the very important issue of quality control for landmark points in public registration data sets that underpin much of MICCAI (R1)”, “This topic is rarely discussed but is of prime importance both in the MIC and CAI communities. (R3). They think the proposed method is “quick, reliable and easy to use (R2, R3)”.

We are also delighted to announce that we will provide all outcomes as open and reproducible research on GitHub for the community during the conference to spur further development in the direction.

We appreciate the careful and thoughtful reviews and the many helpful suggestions, and will address the reviewers’ major concerns in the following:

Q1: R1 is interested in “whether there is a statistically significant difference in the number of suspect points confirmed by the radiographers in these samples.”

We thank R1 for pointing this out and agree that this is a very important aspect, just like the “sensitivity”, for evaluating the effectiveness of a method. As noted by R3, to assess this, all 700+ landmark pairs have to be thoroughly checked by (ideally) multiple expert raters. As it would take several weeks working full-time, we were unable to have it done during the rebuttal period. However, we will add a discussion about the limitation of the variogram method in the final version and conduct more experiments in future works.

Q2: R2 recommended a relevant paper (by Bardosi et al) as a reference.

The paper “Estimating FLE distributions of manual fiducial localization in CT images” suggests using the sample-mean of multiple (9 in the paper)) annotators as the approximator for the true FLE. We think this is an excellent choice for building high-quality landmark-based data sets. We will add discussion in the final version.

Q3: R2 asked “Do the pairs of image data sets have to be registered?”

To use the variogram, two images (in a pair) don’t have to be registered. Users just need to input the coordinates of those landmarks.

Q4: R2 commented that “The method is based on the assumption that wider separated landmarks have larger differences in their displacement. It is not backed by any reference and it is counter-intuitive as it would imply a ‘bias’ proportional to the distance.”

The original description of the assumption (last paragraph of Section 2.1) is that “landmarks that are close to each other typically have smaller differences in their displacement vectors than landmarks that are further apart.”. We believe that if understood correctly, it is different from R2’s “wider separated landmarks have larger differences in their displacement.”. The assumption is backed up and widely used in the geoscience community, and does not imply a “bias” proportional to the distance. We will add some citations in the final version.

Q5: R2 commented that “The method is more or less qualitative”.

In clinical practice, the most prevalent method for examining the quality of landmarks is by visual inspection, which is qualitative. The advantage of the proposed method is the speed, as it can fasten the quality control process. We also believe that this work will motivate others to develop more robust (quantitative) methods to examine the quality of data sets.

All other comments will be carefully addressed in the final version.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposes a novel method for evaluation of landmarks using variograms. All the reviewers see the novelty in the approach and convergend that this paper can provide an interesting discussion point. While some moderate weaknesses remain For one rebuttal point, authors mention that they do not have sufficient time - in my opinion this would not make a case for acceptance, so authors need to be carefult in their rebuttal. Notwithstanding that, the reviewers all see value in the paper, and have all converged to an accept decition.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    In general all reviews see value in the proposed method to evaluate the reliance of manual landmark selection. There are some critical suggestions, e.g. of reviewer 1 to re-check at least a sub-sampled number of landmarks for evaluation, which have remained unresolved. Nevertheless I follow the majority opinion and think the paper serves as an interesting discussion point.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Interesting and thought-provoking work.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



back to top