Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Till J. Bungert, Levin Kobelke, Paul F. Jäger

Abstract

To ensure the reliable use of classification systems in medical applications, it is crucial to prevent silent failures. This can be achieved by either designing classifiers that are robust enough to avoid failures in the first place, or by detecting remaining failures using confidence scoring functions (CSFs). A predominant source of failures in image classification is distribution shifts between training data and deployment data. To understand the current state of silent failure prevention in medical imaging, we conduct the first comprehensive analysis comparing various CSFs in four biomedical tasks and a diverse range of distribution shifts. Based on the result that none of the benchmarked CSFs can reliably prevent silent failures, we conclude that a deeper understanding of the root causes of failures in the data is required. To facilitate this, we introduce SF-Visuals, an interactive analysis tool that uses latent space clustering to visualize shifts and failures. On the basis of various examples, we demonstrate how this tool can help researchers gain insight into the requirements for safe application of classification systems in the medical domain. The open-source benchmark and tool are at: https://github.com/IML-DKFZ/sf-visuals.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_39

SharedIt: https://rdcu.be/dnwBz

Link to the code repository

https://github.com/IML-DKFZ/sf-visuals

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This work is based around studying the effect of silent failures from medical image classifiers and confidence scoring functions that are meant to detect failures. The paper purposes that this is due to distribution shifts between the training distribution and the deployment data, in the case of medical image mainly corruption, manifestation and acquisition shifts. The paper presents a detailed analysis on various confidence scoring functions across multiple medical image analysis datasets from different medical domains. These experiments found that non of these methods are sufficient to deal with these silent failures and in hopes to better understand where these silent failures occur the paper presents a visualization framework to help identify silent failures and their various causes.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The work is heavily based on a previous piece of work but the extension to medical images datasets give very useful insights to the MICCAI community, that has not seen much insight before. The paper conducts a range of experiments that give meaningful insight into different confidence scoring functions and how their performance can be impaired when presented with different types of distribution shift often found in the application of medical imaging systems to clinical environments. These experiments used a wide range of publicly available medical image datasets across 4 different medical image domains. The experiment results are well reported and the additional reporting in the supplementally material gives a greater understanding of the results. The experiments were also averaged over numerous runs giving an indication into the stability of the method. The visualization framework presented in Figure 1 is a great tool to help identify these silent failures and their causes. Figure 1 also helps establish the concept of silent failures to the reader by giving them a visual example at the beginning of the paper.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The only Bayesian method confidence scoring function used is Monte Carlo Dropout. While this technique is still used in some modern papers, it has been criticised for not being a Bayesian method at all and instead just a computational cheap method to obtain estimated uncertainty (Is MC Dropout Bayesian? from Le Folgoc et al). Would other Bayesian neural network work better in place of MC Dropout in this study? The study design has limited novelty as it is an extension of the work previously presented from Jaeger, et al.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

All work for this paper uses publicly available data and all code will be made available online. In the supplementarily materials the paper presents details training parameters for each of the trained models.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

From the comment in the weaknesses section, would using other Bayesian neural network method work better for this study due to previous criticism about Monte Carlo Dropout. Further to this if Monte Carlo Dropout is to be used in this study could the justification for it use be extended in the paper?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a well written and clear case that modern classifiers are not robust to distribution shifts, leading to silent failures that confidence scoring functions do not pick up on. The study presented is similar to that of the one presented in Jaeger, et al (paper presented at ICLR 2023) extended for medical image datasets across medical domains. This paper would have scored higher if there was more novelty to the study but the study is still informative and useful to the MICCAI community. The second contribution was a visualization framework to help identify the previously mentioned silent failures which can help in the deployment of models into clinical applications.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

I think this paper has merit and should be accepted. While I appreciate that the concerns reviewers 3 and 4 presented in their reviews I believe the authors response address their main concerns whilst also reinstating the main contributions of the work and make a clear case for the works acceptance.

Review #3

Please describe the contribution of the paper

The authors discussed the problem of detecting silent failures in medical image classification. Through comprehensive evaluation, the authors demonstrated that none of the currently available methods are reliable. Additionally, a visualization tool that facilitates the silent failure detection is developed in this paper.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors conducted a comprehensive evaluation of the current silent failure detection methods, and demonstrated that no SF detection method is reliable.
- A visualization tool was proposed to show the classifier failure in different cases.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The visualization tool does not provide too much insight on how to improve the currently available classifiers.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No concerns here as the code will be available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- Further discussion on how the SF-visualization tool can guide the design of more robust and reliable classifiers would make the paper stronger.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The evaluation results of SF detection are certainly valuable, however, the proposed SF visual tool provides little unexpected insight.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

A paper aiming to showcase a visual interactive assesment tool to understand failures of deep learning networks on multiple confidence score functions.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

I find the paper interesting but believe that this type of work has been done under the purpose of calibration methods in similar works, but looking at multiple methods is potentially different. However I think calling it silent failures is perhaps the same as what model calibration aims to fix and indicate. It is good work, just not sure of the terms utilised but the suthors work does aim to focus on visual analysis of failures in this regard the paper and makes a contribution to the field.

Furthermore I am pelased to see an optimsation strategy fro the hyperparameters and an analysis of confidence not only wrt SoftMax - this shows good thought and laytout of experiments.

However in general I do not find it novel but rather a joint use of current methods as a sort of saliency mapping?
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

I think the language/terms used can be improved. When reporting values over three runs, one should include the std deviation as well.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The code has been provided and in this way makes it easy for other researchers to utilise.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

In the Methods Section, using the word spirit seems to not suit the style of the paper here and a more technical term would compliment the nature of the work better.

The Tabulated results (Table 1), three runs were performed, so a standard deviation should be used and reported. Also ensure the quotation marks are all correct, they are currently formatted incorrectly.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

It is borderline for me as there are some improvements and not quite sure about the novelty of the paper, but it is an important area of research we need to pursue.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

I am happy with the author’s feedback as the point clarifying the ‘Definition of SF’ to general calibration methods is clearer and happy with subsequent changing of Fig. 1a/b from max-softmax to ConfidNet. Also I am happy that they have thoroughly revised their manuscript to avoid future misunderstandings as often different papers have different perspectives on these important topics.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The reviewers in general this manuscript was an interested experimental evaluation of silent failures in medical images and that the evaluation was thorough on multiple datasets with good insights.

The primary critique all reviewers brought up was that the methods was not novel but rather appeared to be an application of an existing method to medical data. The authors must address this point, are there any novel methodologic contributions and/or why is it necessary to evaluate on medical images (would the authors expect the method to work differently in a medical context?)

There was also some concern by R2 on the usefulness of the visualization method especially on how to use it to improve current models.

Author Feedback

We sincerely thank all reviewers for their time and effort and appreciate the valuable feedback and the generally positive resonance (“very useful insights to the MICCAI community”, “thorough evaluation”, “great tool”). While we are concerned about incorrect statements in reviews R4+R3, we are optimistic that resolving these misunderstandings will resolve the related sentiment of a “lack of novelty”.

Resolving misunderstandings:

Calibration (R4): “this type of work has been done under the purpose of calibration methods”. This is incorrect. Calibration and silent failure (SF) prevention are two fundamentally different tasks. A classifier can, e.g., be perfectly calibrated and still yield substantial amounts of SF and vice versa (see e.g. Jaeger et al. Appendix C for an in-depth discussion). Thus, the importance of confidence ranking tasks (like SF prev.) has been widely acknowledged in the ML community. Further, as calibration is generally constrained to predicted class scores, the statement shows a flawed understanding of SF ignoring the general uncertainty estimation (UE) stage, represented by the confidence scoring function (CSF).

Definition of SF (R3+R4): A SF is not simply a classifier failure, but a failure of both the classifier and the CSF (CSFs are general functions for UE that can be different from max-softmax (!), see e.g. Table 1). Crucially, the two are individual components of a two-stage system where one can prevent and the other detect the failures. To clarify this distinction, we changed the example-CSF in Fig. 1a/b from max-softmax to ConfidNet.

Visualization tool (R3): Neglecting the SF definition seems to have also affected the review of R3, who incorrectly states that the “tool shows classifier failure” and thus concludes that the tool “does not provide too much insight on how to improve [..] classifiers”. Yet, the goal is not to improve classifiers, but to, for the first time, visually connect classifier, UE-stage, and images (note that colors in Fig. 1b/right depict the CSF!). This allows identifying crucial silent failure sources in the data and thus to reflect on dataset and method design for a given application (see examples in Sec4.2). Yet, R3’s criticism applies in that using the tool for future development of improved CSFs requires comparing failure modes across CSFs. We added a respective demo case to the manuscript.

Saliency mapping (R4): Our tool does not perform saliency mapping, a fundamentally different concept, where individual classifier decisions are spatially attributed to image regions.

The fact that SF prevention can cause confusion shows the novel perspective and the demand for a dedicated introduction in the MICCAI community. We have thoroughly revised our manuscript to avoid these misunderstandings in the future. Also, we would like to kindly ask R4+R3 to re-consider their assessment of our work based on the resolved inaccuracies.

Novelty:
We do not “apply existing methods to medical data” (Meta-R), but we introduce a new task to the MICCAI community, i.e. a specific use-case of UE that is highly relevant - yet currently not considered - in medical applications. This introduction entails 3 novel contributions: 1) An open benchmark incl. rigorously crafted distribution shifts specific to medical scenarios on 4 data sets. 2) A study of prevalent CSFs in these new settings providing valuable insights as a basis for future method development. 3) SF-visuals, the first tool to visually interconnect classifier, UE-stage (!), and images, fostering a deeper understanding of data- and method-related SF sources in specific applications.

A new benchmark/insights/visual-tool might represent a more abstract novelty than e.g. an architectural modification improving a task by some %. Yet, we strongly believe our community needs to re-think novelty in order to ensure true scientific progress, as excellently argued in: “Novelty in Science”, https://shorturl.at/lszVW

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

After rebuttal all reviewers are in agreement that this manuscript is worth of publication in MICCAI.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper contributes a thorough demonstration of how existing methods for detecting silent failures for classification can fail on a range of medical imaging datasets. Prior to rebuttal, the paper was criticised for not contributing novel methods, which is true – but as the authors also bring up in their rebuttal, their paper is important both for showcasing to the medical imaging community that the problem with silent failure is real and something to be aware of – but also to clear up confusions, also met in the reviews, regarding the interaction between calibration and silent failure. I agree with the reviewers’ point here, and so do the reviewers, who now unanimously recommend acceptance – as do I.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

For this paper, there seems to be a good consensus on acceptance by all reviewers. Of note, I do share the thoughts of R1 on the marginal contribution of simply following [Jaeger et al, ICLR 2023] but switching to medical data, and I also think that we, as a community, should stop using MC-Dropout as a straw-man in uncertainty/calibration comparisons. In addition, after reading this paper I got the same feeling as when reading Jaeger et al. that the authors should relax their heavy usage of acronyms, it demands too much mental effort and makes for a very unpleasant read: one often finds in need of having a dictionary of acronyms at hand for understanding parts of the text. Anyway, overall the paper seems useful for the MICCAI audience and I see no reason to recommend rejection.

back to top

Understanding Silent Failures in Medical Image Classification