Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Charles Lu, Anastasios N. Angelopoulos, Stuart Pomerantz

Abstract

The regulatory approval and broad clinical deployment of medical AI have been hampered by the perception that deep learning models fail in unpredictable and possibly catastrophic ways. A lack of statistically rigorous uncertainty quantification is a significant factor undermining trust in AI results. Recent developments in distribution-free uncertainty quantification present practical solutions for these issues by providing reliability guarantees for black-box models on arbitrary data distributions as formally valid finite-sample prediction intervals. Our work applies these new uncertainty quantification methods — specifically conformal prediction — to a deep-learning model for grading the severity of spinal stenosis in lumbar spine MRI. We demonstrate a technique for forming ordinal prediction sets that are guaranteed to contain the correct stenosis severity within a user-defined probability (confidence interval). On a dataset of 409 MRI exams processed by the deep-learning model, the conformal method provides tight coverage with small prediction set sizes. Furthermore, we explore the potential clinical applicability of flagging cases with high uncertainty predictions (large prediction sets) by quantifying an increase in the prevalence of significant imaging abnormalities (e.g. motion artifacts, metallic artifacts, and tumors) that could degrade confidence in predictive performance when compared to a random sample of cases.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_52

SharedIt: https://rdcu.be/cVVp8

Link to the code repository

https://github.com/clu5/lumbar-conformal

Link to the dataset(s)

N/A

Reviews

Review #2

Please describe the contribution of the paper

This paper propose a distribution-free method of estimating uncertainty of ordinal predictions, with a simple approximate algorithm.An experiment is carried out on an stenosis grading task from MRI. The result is high-performing in comparison to the competing method, and the uncertainty score is able to select problematic examples as revealed by manual radiology review.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Clinician trust in AI prediction is an important topic. Mathematical guarantee such as presented could have significant impact on adoption of AI technologies.
- The algorithms proposed is simple, and relatively intuitive.
- The paper is well-written and easy to follow. The symbols are mostly clearly defined, and the explanation is intuitive.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The performance of the proposed method seems largely on par with LAC in the evaluation presented. It is unclear what the benefit of the current method is. Does it suffice to simply apply LAC to the current problem to get an uncertainty estimates?
- While the authors have demonstrated that high uncertainty cases tend to contain problems, is this true on the other end of spectrum as well? That is, do low uncertainty examples also contain less problems?
- The proof of theorem and proposition is left out completely from the main text. Some high-level idea should at least be included for the reader to follow along. In the supplement itself, the proof is also largely left to the cited article. It can be helpful to reproduce those proof so that it is easier to understand for readers who may not be as familiar with this literature.
Some follow up questions/comments:
- why do higher severity prediction have higher uncertainty scores? Are the problem becomes harder to recognize as the severity increase?
- There are some runaway text between Fig 4 and 5.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Very good. The algorithm is clearly described and code is promised to be released.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

See my weakness section.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think this is a well-written and solid experimentation over all.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Somewhat Confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #3

Please describe the contribution of the paper

Th paper proposes an ordinal prediction set method that is guaranteed to contain the reported severity with a chosen probability in the context of automatic disease severity rating. It also shows a method to quantify the uncertainty in this setting.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. A novel ordinal prediction set algorithm that was shown to perform similarly to a non-ordinal algorithm in output set size and coverage.
2. A method for quantifying uncertainty in this setting is provided.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The provided evaluation of uncertainty estimation is biased. The radiologist checked only the cases with high uncertainty for potential issues, but the cases with lower uncertainty might have issues as well.
2. Clarity issues in the algorithm description and few other occasions
See details in the comments section.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors stated that they will make the codes publicly available
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. The evaluation of uncertainty estimation is biased. The radiologist checked only the cases with high uncertainty for potential issues, but the cases with lower uncertainty might have issues as well. Cases from low uncertainty regimes should be sampled as well and presented to the radiologist, and he should not know whether the presented cases were indicated as having “high uncertainty” or “low uncertainty” by the algorithm.
2. Ordinal APS algorithm, which is the proposed method, is only described in pseudo code without additional explanations. It will be helpful to describe the algorithm also in a paragraph text. The following was unclear in the pseudo-code:
  - Initialization: looks like in the beginning S is an empty set. It is therefore not clear how y’ is selected.
  - How the pseudo-code makes sure the same y’ is not selected twice?
3. Table 1 in appendix:
  - Inconsistent “grading class” order. E.g. “Ordinal CDF” appears first in the “coverage” row but last in the “Size” row. This makes results interpretation confusing.
  - The rows description of “Coverage” is too large visually and goes beyond the described rows.
  - It is not clear what the “count” row means in the stratification by Set size.
  - The title talks about stratification by true stenosis grading label, but looks like there is also a separate stratification by Set size
4. Figure 5:
  - Inconsistency in the chosen cases for radiologist’s inspection - there are 5 stars in each category in the plot but in the description looks like only 4 points were taken from each category.
5. It is stated that “LAC provably has the smallest average set size but achieves this by entirely ignoring conditional coverage, and does not consider ordinal information”. However, results did not showcase any limitation of the LAC algorithm and it was shown to perform similarly to the proposed method.
6. Minor grammar issues: - Page 4: the sentence “In practice, therefore approximate Ordinal APS…”
  - The last sentence on page 6 is cut.
    - The first sentence in “conclusion” section: “into” should be removed
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors stated that the training and evaluation code will be made available.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Somewhat Confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

The paper proposes a novel ordinal prediction set algorithm that was shown to perform similarly to a non-ordinal algorithm in output set size and coverage. It also shows a method for quantifying uncertainty in this setting is provided. However, the method was tested only on a single dataset.

Review #6

Please describe the contribution of the paper

Authors evaluate the usage of conformal prediction sets for uncertainty prediction in clinical settings.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The motivation for the work is sound, uncertainty estimation is known to be a serious problem for deep learning methods deployed in clinical settings. Overall, the paper is well structured with no immediate flaws.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The motivation for the paper is that deep learning methods suffer from certain shortcomings and that radiologists have hard time trusting these models when these models fail in unexpected ways. The authors even cite a survey which shows that radiologists have indeed hard time trusting such models. Given that the aforementioned fact is the main motivation of the paper, I expected to see a language that is more digestible by radiologists whereas the language employed in the paper is not straightforward to consume by a person who is not familiar with the underlying math.

Herein, lies a contradiction: a) if the target audience of the paper is computer scientists, the experimental results on a single dataset with a single model fail at showing convincing results. The IID assumption can easily be challenged in the context of medical imaging, which opens long discussions and puts the correctness of Theorem~1 at stake. b) if the target audience is radiologists, the language employed is not appropriate.

Given the evidence in the introduction, I believe authors envisioned the paper to target radiologists, rather than computer scientists. With this assumption, I suggest the authors to be more explicit with the definitions, theorems and propositions to allow digestion by a larger audience. This paper must be understood by clinicians who are familiar with the concept of deep learning but not so math-savvy. I understand that number of pages is a limiting factor, my suggestion is to 1) Move Algorithm~1 to supplementary and use that space to enhance the text (algorithm pseudo code also has a limited value since the authors will share the code anyway). 2) Put images in Fig~4 next to each other instead of two rows.

Also, experimental details need enhancing, what model has been used to obtain scores? How was it trained? How was the data split? Many details are missing.

Other than that, there are number of typos in the paper: 1) (Page 2) set Y is defined to be within {0,…,K-1} but later used in \arg\max as y \in {1, …, K}, is this the same K? If so, pick one range, if not, use another letter. 2) (Page 2) D_test = {–}^{n} , is it n or n_{test}? since train is defined as n_{train} 3) (Page 4) “sequence of nested sets that includes Y Using …” , missing full stop? 4) (Page 4) at the end of Proposition 2, “… satisfies 3”. 3 what? (3)? Equation (3)? 5) (Page 6) “majority class (—-). observed in Figure 3” full stop by mistake? 6) (Page 7) Please put Fig~4 and Fig~5 right after each other with no text in between. Reading a single line between those figures makes it awkward. 7) (Page 8) Full stop missing after \textbf{Uncertainty Quantification for Critical Applications}.

The language employed makes or breaks the paper for me, since the current drag for widespread AI deployment is mainly due to the resistance from radiologists. As such, papers like this one which show convincing results on these points of stress need to be digestible by clinicians.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Some parts are reproducible but overall, many details about the model employed is missing.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

See weaknesses section for the detailed comments.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Mainly, inappropriate language and missing experimental details
Number of papers in your stack

7
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Somewhat Confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper addresses the challenge of providing guaranties associated with predictions of ranked outcome. The authors introduce methods for generating ordinal prediction sets guaranteed to contain the reported severity with a user-chosen probability. The authors use formal mathematical guarantees that give clinicians explicit assurances about the algorithm’s performance. The authors make very wide claims about the guarantees, that they are distribution-free and work for any pre-trained model, any possibly unknown data distribution, and in finite samples. However, the authors corroborate and illustrate the validity of their approach using a single dataset. Also, the method to support the results through manual evaluation of radiologists is biased, as discussed by the reviewers.

The approach presented in the paper is novel and the problem addressed by the authors is important. The reviewers have different opinions regarding the paper. Please see their comments and questions and provide your respond.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

14

Author Feedback

Overall, reviewers agreed that our work is well-motivated, technically sound, and addresses an important challenge of deep learning for clinical application. We understood the main concerns to be 1) our evaluation in Section 3.2 is biased and 2) the language in Section 2 is too technical for a clinician audience.

“The evaluation in Section 3.2 is biased” [R2, R3] - We agree with the reviewers that the results of our experiment could be strengthened by also evaluating low uncertainty cases, in which we know the base rate of abnormalities to be quite small. This dataset was collected in order to develop a commercial AI tool so was already highly curated with exclusion criteria to filter out obviously problematic cases. Subsequently, our clinicians estimate the base rate of abnormalities to be less than 5%, much lower than 42% of the high uncertainty group. In the final draft, we intend to estimate the true rate of abnormalities by having the radiologist review a random sample of studies to quantitatively enhance our results.

“The language in Section 2 is too technical for clinicians” [R6] - We agree and thank the reviewer for pointing this out. For the final draft, we will work with clinicians to simplify this section to be more accessible to a clinical audience. We take seriously the need to improve our presentation to better communicate our methods to those like radiologists who are currently unfamiliar with conformal prediction but may benefit from its integration into their workflow and will include description of clinically realistic potential applications. We have moved Algorithm 1 to the supplement, added explanatory diagrams, and also included more intuitive language including revisions of the main theorems, such as below.

\textbf{Original} \begin{theorem}[Conformal coverage guarantee] Let $(X_1,Y_1)$, $(X_2,Y_2)$, …, $(X_n,Y_n)$ and $(\Xtest,\Ytest)$ be drawn independently and identically distributed from distribution $\P$, and let $\Tlam$ be any sequence of sets nested in $\lambda$ such that $\underset{\lambda \to \infty}{\lim} \Tlam = \Y$. Then, choosing $\lhat=\A(\Tlam, \; \alpha)$ implies $\Tlhat$ satisfies (3). \end{theorem}

\textbf{Revised:} \begin{theorem}[Conformal coverage guarantee] Let our calibration data be $X_1,…,X_n$, MRI images sampled from a population of patients, and $Y_1,…,Y_n$, their associated ground-truth severity interpretations. If a new patient with MRI $X_{\rm test}$ and severity $Y_{\rm test}$ is drawn from the same population, then the conformal prediction set, $\Tlhat$, will contain the true rating with probability $1-\alpha$, where $\alpha$ is the error rate chosen by the user, \begin{equation} \P\Big( Y_{\rm test} \in \Tlhat(X_{\rm test}) \Big) \geq 1-\alpha. \end{equation} \end{theorem}

“Regarding comparison to LAC” [R2] - LAC provably has smallest set size but does not respect ordinality, e.g., LAC may predict both no disease'' andmost severe’’, which will confuse the user. This scenario is impossible with Ordinal APS, but we match the optimal set size of LAC.

“Comprehensive proofs.” [R2] - In the final draft, we will reproduce the referenced proofs in the supplements.

“Correlation between severity and uncertainty” [R2] - In general, the radiologists’ interpretations of severe cases have a higher variance, and the uncertainty reflects this. Additionally, the dataset contains fewer severe cases than mild or normal cases, which decreases the model’s confidence.

“Additional details about the Ordinal APS algorithm” [R3] - This was a typo: $\mathcal{T}\lambda(x)$ should be initialized to $\arg \max\hat{f}(x)$. An element cannot be selected twice because $S \cap \mathcal{T}\lambda(x) = \emptyset$ at each iteration.

“Additional details about the stenosis model” [R8] - The stenosis model was based on previous work; we will provide more comprehensive details on its development and training procedure in the supplements.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors nicely answered most of the reviewers concern. This is a good paper that fits MICCAI. I ask the authors to make the revisions they discussed in the rebuttal. It will make the paper an excellent one.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

5

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

the rebuttal has addressed the concerns of the reviewers. One reviewer who recommended weak rejection has not responded to the rebuttal.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

6

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper suggests an ordinal prediction set algorithm for associating confidences with predictions in a distribution-free manner. While it is unclear to me why it is good to be distribution-free, the paper seems sound enough, the reviewers are largely pleased with the paper and the rebuttal addresses the reviewer concerns well.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

9

back to top

Improving Trustworthiness of AI Disease Severity Rating in Medical Imaging with Ordinal Conformal Prediction Sets