Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Micha Kornreich, JinHyeong Park, Joschka Braun, Jayashri Pawar, James Browning, Richard Herzog, Benjamin Odry, Li Zhang

Abstract

Labeling for pathology detection is a laborious task, performed by highly trained and expensive experts. Datasets often have mixed formats, including a mix of pathology positional labels and categorical labels. Successfully combining mixed-format data from multiple institutions for model training and evaluation is critical for model generalization. Herein, we describe a novel machine-learning method to augment a categorical dataset with positional information. This is inspired by the emerging data-centric AI paradigm, which focuses on systematically changing data to improve performance, rather than changing the model. In order to improve on a baseline of reducing the positional labels to categorical data, we propose a generalizable two-stage method that directs model attention to regions where pathologies are highly likely to occur, exploiting all the mixed-format data. The proposed approach was evaluated using four different knee MRI pathology detection tasks, including anterior cruciate ligament (ACL) integrity and injury age (5082 cases), and medial compartment cartilage (MCC) high-grade defects and subchondral edema detection (4251 cases). For these tasks, we achieved specificities and sensitivities between 90-94% and 78-93%, respectively, which were comparable to the inter-reader agreement results. On all tasks, we report an increase in AUC score, and an average of 8% specificity and 4% sensitivity improvement, as compared to the baseline approach. Combining a UNet network with a morphological peak-finding algorithm, our method also provides defect localization, with average accuracies between 4.3-5.1 mm. In addition, we demonstrate that our model generalizes well on a publicly available ACL tear dataset of 717 cases, without re-training, achieving 90% specificity and 100% sensitivity. The proposed method can be used to optimize image classification tasks in other medical or non-medical domains, which often have a mixture of categorical and positional labels.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_18

SharedIt: https://rdcu.be/cVRY0

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a method for training neural nets from positional-class pair labels. The authors develop and validate their idea on knee MRI datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • External validation of the developed method
    • Conceptually right direction - adding positional information is naturally a way to ensure that the network learns from relevant spots
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Too heuristic approach, and it is unclear how clinical relevance of the developed method can overcome this
    • The authors re-invent object detection, and should have tried single-shot object detection architectures as baselines
    • Statistical correctness of the claims has to be verified before this work is published.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper is very difficult to reproduce due to a) private dataset and b) highly complex multi-stage pipeline.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Dear authors,

    thanks for submitting your work to MICCAI. While in overall the paper investigates an interesting direction, there are clearly some limitations.

    • As I have mentioned earlier, this paper re-invents object detection. The authors should compare to methods like SSD and YOLO. The labels can be easily converted to the required format. In my opinion, it is fairly straightforward to build a 3D implementation of YOLO.
    • Please, run statistical testing of your results, i.e. compute a standard error over runs, and preferably execute a statistical test itself.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    No statistical testing, lack of comparison to baseline methods, difficulty to reproduce.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    The paper presents a two-stage method for classification of 4 abnormalities from knee MRI. The first stage performs rough detect localization, the second - binary classification. The authors put focus on the data-centric aspects - heterogeneity of the available MR images and the corresponding annotations across institutions - and study the relative importance of positional (point location) and categorical (defect grade) labels in detection of knee abnormalities. The results are three-fold: (I) combining positional and categorical labels was shown to be beneficial for the overall performance, with the proposed automatic method reaching the level of inter-reader agreement, (II) data augmentation done also between the stages lead to further performance increase, (III) training with diverse multi-institutional data (25 locations) yielded high performance also on the unseen public dataset obtained with a different MR protocol, which support the importance of combining the datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. While the proposed method is essentially a two-step localization-classification pipeline with multi-task supervision, it is meticulously designed and evaluated by the authors. Consequently, the study provides a range of insights into the value of integrating the diverse annotations, performance of automatic defect localization, performance of automatic defect grading, and value of multi-institutional (multi-view, multi-protocol) MRI data. Overall, the study presents a very strong clinically-relevant evaluation.
    2. While the multi-institutional data is one of the core assets behind the study yet it is private, the authors make a focused effort to describe the dataset in detail, including the protocol- and the defect-related details.
    3. The article is written in an excellent way considering all the parts - problem formulation, structure, data description and preprocessing, method description, experiment design, statistical analysis, interpretation of the results, conclusions, valuable supplemental materials. Excellent graphical materials.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The architectural novelty of the proposed solution is rather limited. The study also does not review the state-of-the-art methodological solutions for the task. Subsequently, it is not clear, for example, why an end-to-end is not considered.
    2. The paper is missing a discussion on the comparison of the proposed method against the prior studies, at least, in scope of ACL injuries - Namiri et al 2020 (https://doi.org/10.1148/ryai.2020190207), Astuto et al 2021 (https://doi.org/10.1148/ryai.2021200165).
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    1. Overall, very high reproducibiliy score. The checklist provided by the authors is in a agreement with the provided details.
    2. Few aspects regarding the private dataset and the data management are to be clarified. See the comments in next section.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Page 2: “even different image types”. The previous paragraph actually prioritizes the difference in the imaging protocols, so it would be better perhaps to rephrase “even” here.
    2. Page 2: “However, methods to combine categorical labels with other positional label types, such as point-landmarks, were not addressed.”. This is a false statement. A direct counter-example - Mask-RCNN (He et al), in knee domain - KNEEL + DeepKnee (XR, Tiulpin et al) and Namiri et al 2020 (MRI, https://doi.org/10.1148/ryai.2020190207), broad overview of the methods - Crawshaw et al 2020 (https://arxiv.org/abs/2009.09796). Please, rephrase.
    3. Page 2: “A first model, to our knowledge, trained for ACL injury-age and subchondral edema underlying the cartilage defect pathology detection.”. Firstly, according to the paper, the model is not trained for the tasks simultaneously but 1 model per task. Secondly, please, see Astuto et al 2021 (https://doi.org/10.1148/ryai.2021200165) and adjust the claims accordingly.
    4. Page 3: Please, elaborate on how the splits are done - balancing w.r.t. number of samples controls/cases, other?
    5. Page 3: “25 different institutions”. Please, provide a reference to the dataset or the prior publications with it. For camera-ready version, please, strongly consider also elaborating on the demographic info (at least, age range, sex balance, geography of the institutions).
    6. Page 3: “Labels used by models”. Please, provide few references on the rationale behind pooling of the grades, i.e. why are ACL and MCC injuries were pooled in a certain way.
    7. Page 3: “1398 studies”. Please, rephrase to not start the paragraph with the number.
    8. Page 3: “detected using a deep reinforcement learning model”. It would be great to also briefly provide the detection performance in the text.
    9. Page 6: “only the “best” candidate was selected”. It may read as only one defect per knee was considered. Please, elaborate here what “candidate” means.
    10. Page 7: “Stage I training was designed to achieve high sensitivity, since false positive studies would be filtered by stage II. Indeed, in three tasks we observed sensitivity exceeding 95% (Table 2). However, in the Cartilage Edema task we obtained 89%.”. There is actually not a single task where sensitivity is >=95%. For MCC, 89% is only in AUC, not sensitivity. Please, review and fix this paragraph.
    11. Page 7: “4.9 mm localization accuracy”. Please, provide the credible intervals, if available.
    12. Page 7. “Two models only used positional-labels in training (Labels = Posit. in Table 2).”. Please, elaborate on how the classifier is trained in this case. These two experiements are rather difficult to understand.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
      • Method for grading knee lesions from multi-view multi-protocol MRI.
      • Strong clinically-relevant evaluation of the proposed method.
      • Interesting results showing the value of (I) combining positional and categorical annotations, (II) multi-institutional data for model robustness.
      • Limited methodological novelty of the work (not end-to-end, not multi-label).
      • Limited analysis w.r.t prior art, both methodological (deep learning) and clinical (lesions in knee MRI).
      • Excellent quality paper, with very high reproducibility score.
  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    Interesting study exploring use of both text level global labels and pixel level annotations for anomaly detection on multiple tasks. The study is conducted on a large dataset making ablation results comprehensive. The validation of model with public dataset is notable.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength is the ablation study of the various training methods and the comparison with inter-reader agreement.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The strategy and rationale in using peak finding algorithm in Stage I is unclear and needs to be explained better in the main manuscript. Also the methodology of obtaining categorical or global level labels is unclear.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors satisfactorily illustrate the model development methodology, various training strategies, data used and how positional labels were obtained and the comparison with inter-reader agreement between MSK radiologists.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. While the authors demonstrate a clever strategy in using positional labels to improve accuracy of categorical labels, the assumption is there will be real world gains in decreased need for pixel level annotation to build robust models. The paper falls short in exploring this practical aspect (e.g. time taken to used mixed format or reduction in computational resources)
    2. How are the categorical labels obtained? Please expand on this crucial aspect. Are they labelled prospectively by the MSK radiologists or are they retrospectively obtained from radiology reports? If later, how are the labels for the 4 task obtained-, eg. NLP based extraction for key words.
    3. How was resizing done in Stage II, if more than one lesion in different locations in the same study?
    4. The application of peak finding algorithm in Stage I is unclear. Is the best candidate the central point on a given slice, with the shortest distance to lesion of interest?
    5. One of the claimed main contributions of the paper is that this is the first manuscript to detect subchondral edema. There are already multiple state of the art papers which quantify, detect and classify subchondral edema in multiple knee compartments, not just MCC. Please rectify this claim.
    6. Finally, the biggest focus of the paper is that use of positional or pixel level labels augment categorical labels by helping it localize the lesion. Please provide some visual proof of how the model attention performs (eg saliency maps) if possible.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Interesting paper that tests a novel strategy to use pixel level labels to improve locational accuracy of text labels. The hope is such models can reduce need for time and labor intensive pixel level annotations while building ML models. The authors do a good job of providing statistical metrics on various ablation methods but unfortunately do not any real world stats such reduced annotation time without reduction in model performance.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The approache presents integrai]ting text level global labels and pixel level annotations for anomaly detection on MRI for classifying bone-specific pathologies. Reviewer feedback was mixed. I would appreciate it if you cold please address questions relating to (1) prior work in object detection, (2) reproducibility of approaches, (3) limited novelty

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9




Author Feedback

We thank the reviewers for their constructive feedback. In the revised manuscript, we conducted additional experiments which improved reproducibility, added references to previous work, explained topics that required clarification, and addressed the reviewers’ comments point-by-point. (1) Prior work in object detection Reviewers 2 and 3 mentioned previous works on knee-related object detection, which required more reference. Accordingly, Citations to 4 papers on knee ACL and cartilage were added, including Namiri 2020 and Astuto 2021. To point out the difference between our paper and previous works, a clearer distinction between the data types was made in the revised paper. For example, many previous works combined ANATOMICAL landmarks with DEFECT labels in training. However, our present work combines two different DEFECT labels to improve attention and accuracy, which is useful when combining ground truth labels from various sources. Specifically, we combine DEFECT POINT-like landmarks with categorical DEFECT labels, which was not covered before. This is now highlighted in the introduction and expands the novelty of our work. The one-stage SSD approach, suggested by reviewer 1, was tried but abandoned at a preliminary stage due to unpromising results when training with mostly negative samples. The two-stage approach handles this difficulty, which justifies our motivation and is now mentioned in the paper. (2) reproducibility of approaches Two reviewers explicitly mentioned the high reproducibility of our study. However, reviewer 1 requested more statistical verification “before this work is published” and specifically “compute a standard error over runs, and preferably execute a statistical test itself.” We took considerable measures, and these requirements are now met: Five randomly initialized networks were trained for each one of the 24 models appearing in Table 2. Averages and standard deviations were calculated and added to Table 2. A statistical test (McNemar’s) was performed as before. Stage I results already stated the average and standard deviations. (3) Novelty The paper is submitted to the application track which emphasizes “clinical value”, “performance evaluations on large datasets”, “reproducibility”, and “clinical relevance”. These attributes were specifically praised in our reviews. The paper demonstrates novelty of application, data scope, and data handling. Misunderstanding regarding the novelty of application, which relates to the combination of DEFECT POINT-like landmarks (as opposed to defect and anatomical landmarks) with categorical DEFECT labels, was clarified under the “Prior work” in this rebuttal, as well as in the revised manuscript.
The paper covers 4 knee abnormalities. The ACL injury age was not covered by previous AI papers (Astuto et al 2021 mentioned by reviewer 2 classifies ACL integrity, not injury age). Another abnormality, cartilage edema, used a clinically different definition in previous studies. Astuto et. al. detected “bone marrow edema”, which can be caused by traumatic and atraumatic pathologies. In contrast, the edema labeled in our dataset is limited to osteoarthritis-associated edema that is underlying a high-grade defect and is a good predictor of structural deterioration in knee osteoarthritis [Bone Marrow Edema and Its Relation to Progression of Knee Osteoarthritis,Felson 2003]. All statements regarding the novelty of the data were corrected to make these distinctions clear. To the best of our knowledge, similar large-scale studies do not exist for Medial Compartment Cartilage and ACL. To compare, Astuto included 86 ACL full tear cases from 210 patients. Our study covers over 4000 patients with 965 full tears and detects the injury age which was not done by Namiri or Astuto. In addition, we provide inter-reader agreement ratios for MSK-fellowship radiologists. Put together, our results set new standards for scale and scope of knee MRI pathology detection studies.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors satisfactorily addressed the key concerns raised by the reviewers. Hence, my recommendation would be to accept this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal sufficiently addresses the limited novelty and reproducibility issues. The method was also validated on a large public datasets.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Although technical novelty of the paper is limited, the paper presents a novel application of existing deep learning-based techniques to solve important clinical problems and a comprehensive evaluation of the overall method. The latter point was clearly explained in the rebuttal to avoid any misunderstanding.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3



back to top