Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yizhe Zhang, Suraj Mishra, Peixian Liang, Hao Zheng, Danny Z. Chen

Abstract

We aim to quantitatively measure the practical usability of trained medical image segmentation models: To what extent and on which samples a model’s predictions can be used/trusted. We first propose a scheme, Correctness-Confidence Rank Correlation (CCRC), to measure how predictions’ confidence estimates correlate with their correctness scores in rank. A model with a high value of CCRC means its prediction confidences reliably suggest which samples’ predictions are more likely to be correct. But, since CCRC does not capture the actual prediction correctness, it alone is insufficient to indicate whether a prediction model is both accurate and reliable to use in practice. Thus, we further propose another method, Usable Region Estimate (URE), which simultaneously quantifies predictions’ correctness and reliability of confidence assessments in one estimate. URE provides concrete information on to what extent a model’s predictions are usable. In addition, the sizes of usable regions (UR) can be utilized to compare models: A model with a larger UR can be taken as a more usable and hence better model. Experiments on six datasets validate that our proposed evaluation methods perform well, providing a concrete and concise measure for the practical usability of trained medical image segmentation models.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_17

SharedIt: https://rdcu.be/cVRyv

Link to the code repository

https://github.com/yizhezhang2000/ure

Link to the dataset(s)

N/A


Reviews

Review #3

  • Please describe the contribution of the paper

    The authors note that there is often a discordance between accuracy and usability in machine learning models. They propose two metrics to address this issue. First, Correctness-Confidence Rank Correlation measures the association between prediction correctness and confidence. Second, Usable Region Estimation measures the proportion of the population for which the model is clinically usable.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is clearly organized and easy to follow. The problem of model usability is a major concern in medical machine learning that is not adequately addressed in current literature. The idea of combining usability with accuracy in a single metric can potentially be highly useful in real implementation of machine learning models in healthcare.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors mention the main strength of using their metric over assessing accuracy and usability independently is that they are combined in a single metric. However, the combined metric is dependent on several tunable parameters (i.e., the URE metric depends on the choice of a threshold and choice of statistical metric). Since the ranking of models may depend on the choices of these parameters, it may not be clear which is actually superior for a task.

    The evaluation shows the relative performance of several models on different public datasets. This evaluation only shows that the ranking of these models changes based on the metric used, but does not demonstrate that the proposed metric’s ranking is superior. An additional usability study with clinician feedback would be helpful here to show that the proposed metric is more aligned with how users evaluate model output.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Datasets and models are all publicly available, so it should be easy to replicate.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    In addition to the main weaknesses mentioned above, I have a couple of additional comments about the evaluation. First is that there are no error bounds or statistical tests showing significance of differences between the models. Second, it is unclear why different models were used in the sixth dataset evaluation. Third is that the bar graphs in figure 3 can be hard to read. I would suggest creating line graphs to better show the association between the values for different correctness requirements within each model.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed metrics have potential and it is an important problem. However, the evaluation does not demonstrate the superiority of the proposed metrics over existing work. The results are also dependent upon parameter selection, which can make the results unstable and hard to interpret.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #5

  • Please describe the contribution of the paper

    The paper propose a concrete and unified measure, called Usable Region Estimate (URE), to quantify both the correctness and reliability of the model, which is more clinically-usable and helpful for comparing and selecting models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The paper is generally well-written and easy-to-follow; (2) The paper successfully identified the limitations of current model evaluation and selection pipeline, and propose a novel quantitative measure that is more concrete and practicable.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) The advantage of CCRC over ECE is not clearly justified; (2) The limitations of URE is not identified and discussed;

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Sufficient implementation details are provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    (1) In Intro the paper claim that CCRC is more practical, as a high CCRC indicates better alignment between correctness and confidence; I believe this also applied to ECE. If so, what is the advantage of CCRC over ECE? (2) The computation of URE needs a pre-defined correctness constraint, this limits the flexibility of model evaluation; in fact, defining a proper correctness constraint may be hard in many applications. It would be good to see what the authors’ thoughts are on this problem; (3) In Figure 3, we can see that a small change of correctness requirement can lead to dramatic change of UR, e.g. Fig.3-4, a change of 0.01 dice can lead to 0.15 UR change for U-Net. Does this mean URE is very sensitive to the correctness requirement, or in other words, no stable? Because changing the requirement by very little may lead to a different model recommendation; (4) How the conclusion (URE is stable on unseen samples) is drawn from Table 1? It’s hard to judge whether a ~10% frequency violation is good or bad based on the current content.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper propose a novel evaluation metric that quantify the model correctness and confidence in an unified way. The paper is generally well-written. Further discussion and justification on the limitation and experiment evaluation may help improve the paper.

  • Number of papers in your stack

    8

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #6

  • Please describe the contribution of the paper

    The paper proposes two new indeces : CCRC that measures the correlation between correctness scores and confidence estimates and URE that quantifies prediction correctness and reliability of confidence estimates. The authors tested the two indeces on 6 different clinical applications

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • extended experiments on different clinical applications
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • No comparison with other model evaluation methods
    • No justification on why these new index should perform better than existing ones
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    No reproducible

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The authors should justify why these two indeces should be used instead of the existing ones. They should add a comparison and show cases where the proposed indeces are more representative of the model performance

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It is not clear why these indices should give more information on the model performance than existing indices

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    A measure to quantify both the correctness and reliability of a model is proposed. The topic is of interest. Evaluation for several different datasets is presented. The dependency on tunable parameters seems to be a major weakness. The reviewers were not convinced why the proposed measure is better than existing ones.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8




Author Feedback

We thank all reviewers for their valuable feedback. Below, we address their concerns.

On tunable parameters: Many existing metrics have tunable parameters. For example, the IoU threshold is a tunable parameter in mAP (for object detection); pixel distance is a tunable parameter for contour detection and the recently proposed boundary IoU (CVPR21). Whether or not to use URE, setting Clinically Usable Segmentation Accuracy (CUSA) is crucial for developing medical AI models (for determining whether a model is good enough). CUSA is task-dependent and is expected to be determined by medical experts and medical AI practitioners. Moreover, practitioners can use a range of accuracy levels to generate UR diagrams (see Fig.3) for model evaluation. A model may perform better on a portion of samples but worse on another portion of samples, and UR diagrams can capture such phenomena when comparing models. Mixed results in UR diagrams can be viewed as the superiority of a model over other models is unstable (unconvincing). Today, segmentation accuracy improvement is often small in the average correctness score. A model with a better average score may perform worse on a significant portion of test samples. Using the UR diagram, a model is considered superior only when it consistently gives higher numbers across a range of accuracy levels. This helps avoid drawing premature decisions on which model is superior and helps the Medical AI field develop better models in a more rigorous way.

Comparing to existing metrics: URE is the first to unify the correctness metric (e.g., F1 score) and confidence metric (e.g., ECE). It works on top of existing metrics and is superior since existing metrics cannot provide a unified view of correctness of segmentation and quality of confidence estimation in one measure. In experiments, we compared URE with these two types of metrics (e.g., F1 score/Dice and ECE) and showed URE provides more insights and information in segmentation model performance (please also refer to the above UR diagrams discussion).

Stability across accuracy levels: For a test set containing fewer samples, large changes on URs may occur across different accuracy levels. A larger test set leads to smoother UR changes, evidenced by the UR diagram of BUSI in Fig.3. The smoothness of change also depends on the segmentation model itself: if there is a sharp drop on the correctness scores across two accuracy levels, the UR diagram would reflect it. Overall, for a stable segmentation model and using a large enough test set (e.g., >1000 samples), the change of UR should be smooth between different accuracy levels. This is guaranteed by the bootstrapping technique used in the URE algorithm and the central limit theorem. We’ll add theoretical justifications in revision.

Stability on new samples: If samples are i.i.d. drawn and the number of test samples is large enough, the estimated UR should be stable for new samples. We consider 10% violation frequency (VF) as good, but this VF could be lower when more samples are available for estimating UR. The VF should approach to the percentile number used in URE (line 11 in Listing 1.1) with more samples.

Choosing statistics: When summarizing segmentation correctness for a pool of samples, average is used for summarizing F1 scores and ECEs; max/average is used for summarizing Hausdorff distances. By default, we use average in URE but provide other options (e.g., percentile). We consider having options an advantage, not a disadvantage.

CCRC vs. ECE: ECE measures the absolute difference between confidence and correctness scores; CCRC measures their rank correlation. High rank correlation is a key factor leading to a larger UR, but a smaller value in ECE may not necessarily lead to a larger UR.

Feedback from medical experts: We consulted several medical experts for feedback on URE, which was quite positive. URE showed its usefulness in model selection for a joint project on ultrasound video segmentation.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Especially, in our deep learning era, the medical image analysis community is overly relying on incrementally improving correctness scores like F1 or Dice. This work proposes to shift to a more clinically relevant metric, that takes both correctness as well as confidence in predictions into account. I find this a very refreshing and original take on medical image segmentation.

    Reviewers posed a number of questions regarding details of the proposed scores, especially the URE score. While these are relevant questions, I am convinced that (1) authors addressed the raised concerns adequately, (2) minor revision will clarify some issues like dependency on some parameters, and (3) the merit of the proposed methodology is high, i.e. while it may not be what the community finally agrees upon as a more clinically relevant metric, it is a step towards that direction.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper propose a concrete and unified measure, called Usable Region Estimate (URE), to quantify both the correctness and reliability of the model, which is more clinically-usable and helpful for comparing and selecting models.

    Rebuttal seems to address major concerns of reviewers.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9



back to top