Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Adrian Galdran

Abstract

Ordinal classification models assign higher penalties to predictions further away from the true class. As a result, they are appropriate for relevant diagnostic tasks like disease progression prediction or medical image grading. The consensus for assessing their categorical predictions dictates the use of distance-sensitive metrics like the Quadratic-Weighted Kappa score or the Expected Cost. However, there has been little discussion regarding how to measure performance of probabilistic predictions for ordinal classifiers. In conventional classification, common measures for probabilistic predictions are Proper Scoring Rules (PSR) like the Brier score, or Calibration Errors like the ECE, yet these are not optimal choices for ordinal classification. A PSR named Ranked Probability Score (RPS), widely popular in the forecasting field, is more suitable for this task, but it has received no attention in the image analysis community. This paper advocates the use of the RPS for image grading tasks. In addition, we demonstrate a failure mode of this score resulting in a counter-intuitive behavior, and propose a simple fix for it. Comprehensive experiments on four large-scale biomedical image grading problems over three different datasets show that the RPS is a more suitable performance metric for probabilistic ordinal predictions. Code to reproduce our experiments can be found at \url{github.com/witheld}.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_35

SharedIt: https://rdcu.be/dnwBv

Link to the code repository

github.com/agaldran/prob_ord_metrics

Link to the dataset(s)

https://tmed.cs.tufts.edu/tmed_v2.htm

https://www.kaggle.com/c/diabetic-retinopathy-detection


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a measure of accuracy for ordinal regression, called RPS (and its variant sa-RPS), which has been introduced before in statistics but is not well known in medical imaging studies. The authors illustrate several desired properties of the measure (proper, local, well behaved under removal of outliers) and illustrate it on three examples of image-based disease staging where a disease severity score is produced by a neural network classifier.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors tackle an important problem of evaluating accuracy of machine learning methods where the output (i.e., probability of classes) might be somewhat removed from the goal of the population study (i.e., disease scoring). The paper is clearly written and the methods are explained well. The experimental evaluation is thorough.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I found it difficult to be excited by the paper. While the problem it addresses is important, the solution seems only very slightly different from what is done in practice. If one cares about continuous scoring, they would derive the expected value from the ordinal output and use something like MSE or MAE to evaluate its accuracy. The authors compare the proposed measure to a couple of other methods, but not to what is used in practice. I realize it will probably also (slightly) outperform the current practice, but how important is it to have the absolutely best measure if it’s only slightly better? The experimental results suggest that the proposed measure is only slightly better than the baselines chosen by the authors. The paper would be much stronger if the authors demonstrated a case where the current practice and the proposed measure disagreed significantly and showed that the proposed measure better captures the reality of the situation. Or if used in training of the classifiers, the measure leads to significantly better performance. Without such demonstration, this exercise is akin to proposing a new measure of volume overlap for evaluating segmentations. Sure, one could come up with properties that the new measure has, but does it help us develop better medical image computing methods?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Simple method that can be easily reproduced.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    See above. The paper is fine as it is, but is unlikely to have impact on how medical image computing methods are developed.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See above.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The authors provided relevant examples and corrected my misconception about lack of comparison to the continuous measures.



Review #3

  • Please describe the contribution of the paper

    Authors discuss the use of a metric for probabilistic ordinal classifiers in medical image analysis problems, the ranked probability score. This score is well known in other fields but not so much in medical image analysis. Furthermore, they claim a problem with the score and propose a modification that solves the claimed problem. They illustrate the score in 4 ordinal classification problems.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Ordinal classification is indeed an important task and detailed analysis of metrics for this task as well as their value in medical image analysis are interesting for discussions.
    2. I generally appreciate bringing concepts from other fields to medical image analysis. This leads to very interesting follow-up work most of the time.
    3. The paper is really well written. I congratulate the authors.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The presentation in Section 2.1 is often confusing as to what is a contribution and what is recited from previous work. Appropriate referencing or claiming contributions are crucial.
    2. I do not share the belief that problems claimed for RPS are really problems. First of all, why is a penalty assigning quadratic penalty with distance is preferred? Second, why is a lower PRS for the right handside is preferred in (6)? In terms of probabilities, the left hand-side is closer to the real distribution. Furthermore, it is much less ambiguous. Hence, the modifications and their importance is questionable for me. They are not well grounded.
    3. The evaluation has two major issues: a. One cannot appreciate the differences in Table 1 without a statistical test or a reference value. The situation is even more dire when considering differences between RPS and sa-RPS. They are very small in absolute terms. However, a statistical test or a reference value or a normalization is not provided to help appreciate the numerical differences. b. Use of QWK to evaluate the success of probabilistic metrics seems to be a bit problematic. QWK is not a probabilistic metric. It is a metric for hard predictions. This contradicts with the goal here. To give an analogy, this is similar to comparing qualities of Brier’s score and ECE using the accuracy.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is surely reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. I strongly recommend authors to clarify their contributions in the paper. Especially in Section 2.1.
    2. Authors can clarify and discuss better why their modifications is relevant / better for a probabilistic prediction.
    3. Authors can consider the implications of using QWK for assessing probabilistic models. Whether this is suitable or not requires further justifications.
    4. Calibration is mostly studied focusing on the link between the error rate and prediction probability. I recommend authors to consider a similar framework for ordinal classifiers to assess RPS and introduce modifications.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The experimental setup and justification of the proposed modifications reduce my enthusiasm for this paper. Please see weaknesses and additional comments.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The paper sets out an argument for the use of Ranked Probability Score (RPS) in medical image analysis over other metrics such as ECE or Bier score. The paper specifies the limitation of RPS when dealing with symmetrical predictions and suggests a modification to RPS that uses squared absolute to overcome this limitation. To evaluate this method the paper proposes a framework for evaluation based on using the metrics to indicate sample rejection and evaluating the performance on the remaining data. The paper shows that their modification to RPS (sa-RPS) outperforms other metrics when applied to various medical image grading tasks from multiple medical image domains.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and presents a good and clear case for using RPS in a medical image analysis tasks where grading is important. When proposing the use of RPS the paper not only acknowledges and explains the potential issues with symmetrical predictions but also presents a modification to RPS to deal with this weaknesses squared absolute RPS. To evaluate this method the paper proposes a unique framework for evaluating these evaluation metrics by using them as a reject option and evaluating the performance on the now reduced datasets. The experiments presented in the paper use a good range of different medical image grading tasks from multiple medical. RPS and sa-RPS showed significant improvement across two of the datasets tested against and slight improvement over the other two slightly easier datasets. The future work stated in the conclusion section shows a promising direction for RPS and its application to medical tasks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper could have went further with the application of RPS to medical image analysis tasks such as applying it as a loss function to improve the performance of the model compared to standard training. The experiments could have also been repeated multiple times differing the model initialization and data splits and the mean and standard deviation could have shown better insight into the stability of the model.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper uses publicly available datasets to and all their code is available online in a GitHub repository, linked in their paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    There could be more comment on how sa-RPS could be used as a loss function to train neural networks with ordinal classification tasks.

    The experiments could have been repeated multiple times with each run using different model initializations and dataset splits. By showing the mean and standard deviation of the results it can give an indication into the stability of the (sa-)RPS on the models.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper sets out a strong argument for sa-RPS to be adopted into medical image tasks that use ordinal classifiers such as medical grading tasks. The argument made is clear and persuasive and the experiments across a number of medical imaging datasets show its clear benefits.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The rebuttal made many good points addressing the concerns of highlighted by the reviewer partially by R3. I am happy with the paper and recommend it should be accepted.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes using the “ranked probability score”, for assessing probabilistic ordinal classifiers in medical imaging. While the score is well known in other fields, it is not really used in medical imaging.

    The reviewers are mostly positive about this paper, which is very well written, but some concerns remain, and the authors will need to address these during the rebuttal:

    • The reviewers are concerned about how big a difference this new score really makes. Can the authors give an illustration where the metric makes an important difference?
    • Reviewer #3 has concerns regarding the stated problems with existing metrics (q2); can the authors respond to this?
    • Reviewers 2 and 3 both have concerns regarding the solidity of the experimental validation. Could the authors comment on the robustness and statistical validity of their results?




Author Feedback

We appreciate the effort invested by reviewers in studying our work, and the kind comments, e.g. R3: “paper is really well-written. I congratulate the authors”. We divide their main concerns into three blocks, 1) Motivation, 2) Justification and 3) Robustness & Statistical significance.

1) Motivation R1 and R3 appear to be slightly underwhelmed by our work, R3: “Authors can clarify and discuss better why their modification is relevant/better”. R1: “I found it difficult to be excited by the paper”, “is unlikely to have impact”. On the contrary! We believe that using an adequate PSR for ordinal classification changes the game of validating/diagnosing these systems in medical data. The obvious application (mentioned by R4) is that we can use it for training; due to lack of space, this is part of a follow-up work. But RPS also enables fine-grained, rigorous error analysis. Since PSRs assess samples individually, we can sort a test set using (SA-)RPS, NLL, and Brier score. The worst-scored items are what the model considers the wrongest probabilistic predictions.

  • Example: Consider p=vector of probabilities, y=label. On the same Eyepacs test set predictions, both RPS and SA-RPS find worst test case to be: y=0, p~[0.003, 0.002, 0.002, 0.003, 0.99], whereas NLL found worst y=0, p=[0, 0, 0.89, 0.098, 0.012] and Brier y=2, p=[0.003, 0.003, 0.003, 0.001, 0.99]. This clearly shows that (SA-)RPS is better suited for ordinal classification. We have included this example in our evaluation, in order to address R1’s concern: “The paper would be much stronger if the authors demonstrated a case where current practice and the proposed measure disagree significantly and show the proposed measure better captures the reality”.

  • Remark: R1 found disappointing that we did not compare to just using MSE or MAE. But we did! The multi-class Brier score, which we included, is exactly that metric. It is defined as the L2 distance ||p-y||_2.

2) Justification R3 asks two relevant questions, A) why is assigning quadratic penalty with distance preferred? B) why is SA-RPS preferred over RPS? To answer A), let us recall that in medical ordinal classification (disease severity), we tend to prefer quadratic penalty, so that wronger misdiagnosis are penalized exponentially more. This is why the QWK is preferred over the linearly-weighted kappa. Regarding B), we show that RPS favors symmetryic predictions, and argue that this may not always be meaningful. An image might be closer to class K “from the left”, still be annotated as K, and we should not reward our model for predicting the image equally likely to be K+1 as K-1. Or maybe we should, but without prior information this might not be justified. However, R3 is right, ours is a heavy claim, and we have found in the stats literature opposed views. In light of this, we have relaxed our statements: preferring symmetry is no longer described as a pathology of RPS but as a debatable property, so we give the practitioner the chance to choose between RPS or SA-RPS. Thanks for this.

3) Robustness & Statistical significance R3 and R4 were concerned about this. Bootstrapped confidence intervals were already provided in the appendix. We now statistically test significance of performance differences. Almost all tests are significant, unless RPS vs SA-RPS for one of the four datasets, LIMUC: a super-high QWK=90.70 results in both scores finding obvious mistakes first. So as to make room for confidence intervals in the main paper, and to allocate space for the Example in 1) above, we now keep the two more interesting cases of retinal images and cardiac ultrasound classification, and move the other two to the appendix. This simplifies the exposition, allowing us to incorporate and discuss confidence intervals and statistical significance testing, as requested by R3.

Other) R2: “authors should clarify contributions in the paper. Especially in Section 2.1.” We have clearly referenced every concept from previous work.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes using the “ranked probability score”, for assessing probabilistic ordinal classifiers in medical imaging. While the score is well known in other fields, it is not really used in medical imaging. The reviewers were mostly positive about this paper, which is very well written. They did, however, also have concerns:

    • The reviewers were concerned about how big a difference this new score really makes. In their rebuttal, the authors gave good examples of how the score can have clinical utility.
    • There were concerns regarding whether stated problems with state of the art is real – the authors gave explicit examples of why their argument holds.
    • Finally, there were concerns regarding the robustness and statistical validity of the experimental results, which was also addressed in the rebuttal.

    As the authors did a good job of addressing the reviewers’ concerns, I am happy to recommend acceptance of this paper.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The measure of accuracy for ordinal regression has been introduced before in statistics but is not well known in medical imaging studies. Its use may be of only marginal value.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The reviewers agree that the rebuttal is good and tend to accept the paper



back to top