Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Peter Thompson, Medical Annotation Collaborative, Daniel C. Perry, Timothy F. Cootes, Claudia Lindner

Abstract

Developmental dysplasia of the hip (DDH) and cerebral palsy (CP) related hip migration are two of the most common orthopaedic diseases in children, each affecting around 1-2 in 1000 children. For both of these conditions, early detection is a key factor in long term outcomes for patients. However, early signs of the disease are often missed and manual monitoring of routinely collected radiographs is time-consuming and susceptible to inconsistent measurement. We propose an automatic system for calculating acetabular index (AcI) and Reimer’s migration percentage (RMP) from paediatric hip radiographs. The system applies Random Forest regression-voting to fully automatically locate the landmark points necessary for the calculation of the clinical metrics. We show that the fully automatically obtained AcI and RMP measurements are in agreement with manual measurements obtained by clinical experts, and have replicated these findings in a clinical dataset. Such a system allows for the reliable and consistent monitoring of DDH and CP patients, aiming to improve patient outcomes through hip surveillance programmes.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_40

SharedIt: https://rdcu.be/cVRtq

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose an automatic method to determine two radiographic parameters for the monitoring of hip anomalies in pediatric AP hip radiographs. The author propose to use the concept of Random Forest Regression Voting to detect automatically anatomical landmarks for acetabulum and femur. These landmarks are then used to determine the radiographic parameters (AcI, RMP). The proposed system was tested and validated on a clinical dataset of 200 images (400 hips).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method relies on automatically detecting anatomical landmarks, which are then used to determine the radiographic measurements. This generates a system which I believe has a great potential for easy adoptability by clinicians, due to its ability to be explainable.
    Furthermore, the authors tested and validated the proposed method extensively on an independent dataset with multiple manual measurements.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The Random Forest Regression Voting Constrained Local Models framework was previously applied to segment proximal femur from AP hip radiographs, by identifying anatomical landmarks around the femur contour. The contribution in this work is in extending the idea to also include acetabulum landmarks and proposes the application in a new patient population. As such, the work is making a contribution to the field, but does not have a unique novelty.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Methods are well described and/or referenced and can be reproduced. Results are not reproducible since the dataset is not publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The manuscript is very well written and easy to read and follow. I just have a few comments:

    1. I would like to bring the following publication to your attention: Xu W, Shu L., Huang C, et al. A Deep Learning Aided Diagnostic System in Assessing Developmental Dysplasia of the Hip in Pediatric Pelvic Radiographs. Frontiers in Pediatrics, 8 March 2022, Volume 9, Article 785480, doi: 10.3389/fped.2021.785480. The authors of this paper published results of a deep-learning approach to classify Hip Dysplasia, based on radiographic parameters, including AcI. For AcI, the intraclass consistency for the automatic generated and the manual measured value was >0.75. However, in all fairness, with a publication date of March 2022, I would not considers this a “missed reference” :-) However, since it seem to be relevant to your work, I’m adding it to my comments.
    2. The calculation for AiC relies heavily on a small subset of acetabulum landmarks (9,39,5,8). I therefore think it would be meaningful to add the point-to-point errors for just these selected landmarks to the evaluation of landmark detection accuracy.
    3. Can you provide an average (or approximated) runtime for the proposed method?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It is a well written paper with a strong evaluation, but only moderate novelty.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I agree with the other two reviewers that the large deviations in the manual measurements make it difficult to judge the real performance of the algorithm. Despite this, I believe the paper will stay make an interesting contribution to the MICCAI community due to its potential clinical applicability.



Review #2

  • Please describe the contribution of the paper

    The authors propose an automatic system for calculating the acetabular index (AcI) and Reimer’s migration percentage (RMP) from paediatric hip radiographs. These two clinical measures are used for the diagnosis of developmental dysplasia (DDH) of the hip and monitoring hip migration for cerebral palsy. The approach uses the Random Forest Regression-Voting Constrained Local Models (RFRV-CLM) framework to automatically locate landmarks which are in turn used to automatically calculate the measures. They test their approach on a challenging dataset of pelvic radiographs of children containing cases of severe disease and occlusions. They report high conformation between the measures as automatically determined by their approach and those measured by clinical experts.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Significant clinical feasibility for monitoring DDH and achieves SOTA for at least one of the measures, namely RMP.

    The approach is tested on a challenging dataset which includes cases of severe disease. Furthermore, the method is tested on a replication dataset to indicate performance in the wild.

    The authors describe a reasonable effort to establish ground truth dataset with multiple clinician landmark annotations and they also report on the agreement of the landmarking.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This seems to be a direct application of known method (RFRV-CLM). The novelty is The authors do not describe any of the models’ details, model training etc. The methods section focusses largely on experimental design.

    As the authors self report there is a large spread in their ground truth annotations.

    No comparison to other methods is provided.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    To the authors’ own admission the dataset is lacking examples of cerebral palsy although the the RMP measure, for which SOTA is indicated from the results, is an important measure in cerebral palsy monitoring. As the approach applies a known method, the code is not provided (not applicable)

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Provide some clarification on which data was used for the initial model training. Is it the data from the 50 randomly selected images? If not, it is not exactly clear what these 50 were used for? Some clarification on which models were built from which data in the main text would be useful. No comparison to other methods is provided.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the clinical significance for the work is high, and the results of the proposed approach are compelling, this paper seems to be a direct application of known method (RFRV-CLM) limiting the innovation Furthermore, the authors do not describe any of the models’ details, model training etc. The methods section focusses largely on experimental design.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    In this paper, the authors introduce a Random-forest-based method for measuring on coronal radiographs angular parameters of the pediatric hip that have clinical value.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This study is of clinical value.
    • This article takes into account the reproducibility of manual measurement in the evaluation process.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Lack of methodological novelty
    • The evaluation is not based on appropriate metrics and results are not fairly interpreted and discussed.
    • It is impossible to reproduce this work based on the methodological details provided in this paper.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    There is no detail about the random-forest hyperparameters as well as visual descriptors chosen. Therefore this method is barely impossible to reproduce.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • The notion of replication dataset is not clear to the reviewer in this context. It looks more like a hold-out test dataset.

    • The level of expertise of operators that annotated dataR is not described. As on this dataset, there is only moderate agreement between operators, this should be discussed somewhere in the paper.

    • While ICC provide interesting information, reproducibility of a measurement method should be quantified with a confidence interval of reproducibility (and repetability) when possible. See for instance ISO 5725-2:2019.

    • There is no detail about the random forests’ hyperparameters and features. The authors should be more specific about their models.

    • It is not clear to me what is actually the ground truth for AcI and RMP in dataI and dataR. Is it the average of manual annotations or the annotations of just one chosen operator ?

    • there is no evidence in the paper that the manual annotations provided to the random forest lead to robust estimates of AcI and RMP.

    • The values of ICC give here misleading information about the real performance of the algorithm in terms of agreement with manual placement. Figure 4 illustrates that there many patients for which the difference between mean manual and automatic placement is over 10deg for AcI for instance. It can also be seen that “moderate agreement” in dataR is in fact associated with a large confidence interval of reproducibility. Finally, it can also be seen in these graphs that for similar mean manual values, the automatic algorithm can provide variable results. These points should be fairly discussed.

    • The authors claim SOTA results on RMP but this claim is not supported by proofs in the text.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While such a method could have a significant clinical impact, it is not fairly evaluated and discussed. Even if the model is not novel, more methodological details should be provided to the reader.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    The authors have made some useful clarifications. However, it is still unclear to me how the groung truth of training data have been generated. It is stated in the rebuttal that “The DataI ground truth point positions used in the RFRV-CLM training were created independently of the manual clinical measurements.” Given the poor reproducibility of manual placement, it is essential to let clearly know how the annotations have been done. The quality of ground truth measurement is essential for the performances of the algorithm and Fig. 4 seems to demonstrate that the algorithm is mimicking human reproducibility error.

    The reviewer is still annoyed by the fact that the RF would not be described with more details in an updated version and that only supplementary material would be provided. Desigining a RF is a difficult task and showing the community how to implement a successful one for this specific purpose should be central in the reviewer’s opinion.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Reviewers have mixed reviews on this paper. For one side, the paper presents an application of a previously introduced method, i.e., the Random Forest Regression Voting Constrained Local Models framework, to the estimation of two radiographic parameters for the monitoring of hip anomalies in pediatric anterior-posterior (AP) hip radiographs. For the other side, the methodological contribution is rather fair and there are concerns on the experimental design. In the rebuttal, the authors are encouraged to address following limitations:

    • As pointed out by reviewers, the calculation of AiC relies heavily on a small subset of acetabulum landmarks. Thus, it is important to investigate the detection errors on those critical landmarks by adding point-to-point errors and to investigate the influence of such errors on the overall performance.
    • There are other landmark detection algorithms introduced in the literatures, specifically designed for detecting landmarks from AP hip radiographs. Thus, it is important to add additional investigation to compare the method used in this study to other state-of-the-art methods.
    • Details on how to train the model should be clearly presented.
    • How to determine the ground truth should be clearly presented.
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6




Author Feedback

We thank the reviewers for their constructive comments.

Comparison to alternative landmark detection algorithms [R2,MR]: The focus of this work is on the application of landmark detection to address a clinical need i.e. the automation of measurements required to assess hip disorders in children. The paper presents the initial results of a prototype to be translated into clinical practice. Our experiments show that using the manually placed landmarks does not improve the measurement performance (results will be added in the supplementary material). In light of this and given the page limitations, we consider a comparison to alternative landmark detection algorithms beyond scope. Further, much of the existing literature on detecting landmarks from AP hip radiographs is on adult hips and so may not be directly applicable to paediatrics.

Landmark detection performance of critical landmarks [R1,MR]: The point-to-point errors of the points used for measuring RMP and AcI are: AI: mean: 2.47 med: 1.98 90%: 4.89 95%: 5.76 99%: 7.89 RMP: mean: 2.85 med: 1.93 90%: 5.21 95%: 7.23 99%: 11.4 These are similar to the overall results and suggest good performance with some outliers. We will add these to Table 2.

Datasets and training [R2,R3,MR]: All DataI images (n=449) were used in 3-fold cross-validation experiments to assess the RFRV-CLM point placement performance and to get automatic point positions for DataI. In parallel, a random subset of 50 images of DataI was manually measured by 9 clinicians. The DataI ground truth point positions used in the RFRV-CLM training were created independently of the manual clinical measurements. The reported landmark detection performance is based on all DataI images whereas any DataI measurement results include the subset of 50 images only. The DataR measurement results are based on point placements acquired using a RFRV-CLM model trained on all DataI images. We will clarify the above in the text.

Spread and quality of the “ground truth” measurements [R2,R3,MR]: The measurements in question were made by clinicians to provide a baseline for comparison with our automated method. Although, as stated in the text, a large spread makes evaluation challenging, it is unclear why this should be a point of criticism per se. If anything, it emphasises the value of an automated system which can give consistent results. The DataR clinicians were trainees of a similar level of expertise to the DataI trainees. This will be clarified in the text.

Replication dataset [R3]: The replication data was acquired separately at a later date from a different clinical source. This should give the best possible idea of how the method would perform in practice.

Reproducibility [R3]: The system with the RFRV-CLM model trained on all DataI will be made available on publication and a link will be added to the non-anonymised final version of the paper. This will enable researchers to generate the measurements for any available dataset. We will make the hyperparameters available in the supplementary material. The image features are described in the referenced literature.

Appropriateness of ICC [R3]: We note that ICC is commonly used in the clinical literature to assess the repeatability of radiographic measurements. The ICC 95% CI are RMP:[0.87-0.95]/AcI:[0.78-0.91] for DataI and RMP:[0.87-0.92]/AcI[0.84-0.88] for DataR. The ICC values given in Table 3 were calculated using the automatic results and the mean manual measurements. The above will be included and clarified in the paper. Further, as averaging the manual measurements removes information on the spread, we also compare the performance of the system to that of clinicians by fitting a linear mixed-effects model to all manual measurements. This confirms that there is no bias when deriving the measurements automatically.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    In my opinion, the authors’ rebuttal appropriately addressed the reviewers’ concerns.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The clinical applicability of the current manuscript is of great interest, however In their rebuttal some critical points are argued. For example, comparison with other methods is important to highlight the methods advantages. The authors reply “Our experiments show that using the manually placed landmarks does not improve the measurement performance” is not a solid proof. Additionally, ground truth generation is not well-addressed (I agree with R3.20 and R1.20).

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    14



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    i think the rebuttal is reasonably well. Although there are promising sentences like those will be added into supplementary materials and Table 2, the results are not quite surprising, therefore, I support the acceptance of this paper. The minor weaknesses are not strong to be rejecting this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



back to top