Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Valentyn Boreiko, Indu Ilanchezian, Murat Seçkin Ayhan, Sarah Müller, Lisa M. Koch, Hanna Faber, Philipp Berens, Matthias Hein

Abstract

In medical image classification tasks like the detection of diabetic retinopathy from retinal fundus images, it is highly desirable to get visual explanations for the decisions of black-box deep neural networks (DNNs). However, gradient-based saliency methods often fail to highlight the diseased image regions reliably. On the other hand, adversarially robust models have more interpretable gradients than plain models but suffer typically from a significant drop in accuracy, which is unacceptable for clinical practice. Here, we show that one can get the best of both worlds by ensembling a plain and an adversarially robust model: maintaining high accuracy but having improved visual explanations. Also, our ensemble produces meaningful visual counterfactuals which are complementary to existing saliency-based techniques.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_52

SharedIt: https://rdcu.be/cVRsk

Link to the code repository

https://github.com/valentyn1boreiko/Fundus_VCEs

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This submission attempts to maintain highly accurate and improved visual explanations simultaneously for the detection of diabetic retinopathy by combining a plain and an adversarially robust model. The combined model shows better prediction accuracy than the robust model, while with improved visual explanations compared with plain models using experimental results.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Authors attempt to address the dilemma: a plain model usually has better prediction accuracy but less reliable visual explanations; in contrast, an adversarially robust model has more reliable visual explanations but suffers from dropped accuracy. The authors propose a simple ensemble model (Equation 2) to maintain high accuracy and improved visual explanations simultaneously.
2. The ensemble model can also generate a saliency map by the difference between visual counterfactual explanations and the original image, which can serve as a complementary method to existing saliency-based techniques.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

From controlling the trade-off between adversarial and plain training schemes aspect, Equation 2 (adversarial training) and Equation 3 (ensembling) are similar. Therefore, the authors should provide more argument with more detailed discussion on why Equation 3 is effective.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Authors do not provide source code. But with the detailed description presented in the paper, results should be reproduced.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. The justification why Equation 3 (ensembling) is more effective than Equation 2 (adversarial training) should be given more detailed discussion.
2. It is better to design experiments to justify why the values of tunning parameter beta (in Equation 2, adversarial train) could not help a robust model achieve high accuracy and good visual explanations simultaneously.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Novelty and experimental results.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #3

Please describe the contribution of the paper
- The authors propose an ensemble model (consisting of a baseline model and an adversarially robust model) to mitigate the trade-off between classification performance and quality of visual explanations.
- Experiments were performed in three Diabetic Retinopathy (DR) datasets, one of which had DR lesion annotations at pixel level.
- They compare the interpretability results obtained between a visual counterfactual explanation and two common saliency map techniques (Guided-Backprop and Integrated Gradients).
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors propose an ensemble model (including an adversarially robust model) to mitigate the trade-off between classification performance and quality of visual explanations. That makes sense since sometimes a good performance results from spurious correlations in the data, and is not driven by clinically-supported evidence. Improved saliency maps show precisely that models become more robust.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The work lacks novelty. The only novelty element of this work is the use of an ensemble model.
- The two saliency map techniques used in the experimental setting are two of the most simple ones (Guided-backprop and Integrated Gradients).
- Introducing an ensemble increases complexity and makes the model less interpretable.
- It is not clear how the saliency maps were computed (as presented, the network has two branches, each one with the image as input, how were the gradients propagated?)
- It is also not clear how the oversampling of the minority class was performed.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- Reproducibility is guaranteed.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- The authors should provide more details regarding the generation of the saliency maps and oversampling.
- It would be interesting to check the visual explanations generated by more recent saliency map techniques (e.g., LRP).
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- The work is interesting but lacks novelty.
Number of papers in your stack

3
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #4

Please describe the contribution of the paper

This paper suggest two contributions: the proposal to ensemble a tranditionally-learnt and an adversarially-learnt model together in order to improve interpretability, and an adversarial approach to generating saliency maps.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The strength of the model is the clear prediction regarding ensembling traditional and adversarially-learnt models and what the effect would be on accuracy and interpretability. This hypothesis is largely supported by the data, although with a lower robust accuracy than expected (i.e. the ensemble models seem to err on the side of the traditionally-learnt models).
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The paper (correctly) notes that saliency maps are used to help justify a decision made by a neural network model to a clinician. However, this paper only shows something similar to a saliency map for two cases (Fig 1 and Fig 2) in which the diagnosis is very visually apparent and the classification is correct, which is not clinically relevant. What would be clinically relevant are cases that are either non-apparent (with a correct prediction) or ones in which the prediction is incorrect. Without those, it is hard to tell if their technique has value to a diagnostic workflow.

From Figure 1, it is hard to see what the advantage of their VCE approach is compared to the other approaches. Qualitatively, it seems that Guided Back-Prop is more specific than the proposed method. From Table 2, it seems that both comparative methods on average outperform the proposed one, although that might not be stastistically significant. (The authors can easily check this with paired non-parametric tests such as signed-rank tests.)
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper is mostly clear about the techniques/data it used as well as open-source implementations and public databases it uses, meaning that it could conceptually be produced relatively easily.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
Major comments:
- The paper spends a lot of time motivating VCE’s and their derived saliency maps, although no evidence (and some weak counter-evidence) that they are better than simple, openly available frameworks. Ultimately, the paper would be improved by focusing solely on the first contribution, which takes up less than half of the methods (about 1 page of methods+results versus approximately 2 for the same sections for VCE components) but produces meaningful results. What would be warranted here is more ablation studies, studies using different ensembling techniques, or ‘pure’ ensembles of tranditional or adversarially-learnt approaches as well as mixed.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper spends most of its effort on proposing a saliency map approach that doesn’t outperform already available ones. Although this is not a problem in itself, the paper doesn’t do extensive analysis on this which would make it of scientific (rather than technical) interest. The area of more technical interest (i.e. the ensembling approach) also doesn’t receive enough attention (e.g. ablation studies, clinically meaningful visualisation, etc…) to be of sufficient interest to a MICCAI reader in its own right.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposed an ensemble-based method to detect the diabetic retinopathy both with high accuracy and good explanations. The paper presented an interesting approach named visual counterfactual explanations. However, as arised by R3 and R4, the novelty is limited. I also question the approach for visual counterfact with ensemble learning. the details about the equation 2 and 3 are missed. In addition, R3 and R4 point out that the paper is focused on the saliency maps and its motivation, but visual conterfactual explanation is missed. Therefore, the authors should address the concerns in rebuttal.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

6

Author Feedback

We thank the reviewers for their insightful comments.

Novelty & contributions (M,R3,R4): We may have not optimally communicated the novelty of the visual counterfactual explanation (VCE) approach: it can use a given classifier as a generative model to manipulate the image to show how it would have had to look like for the classifier to take a different decision. One can even create an animation as the image is made increasingly “healthy” or “sick” (see Suppl. Material). We applied this approach for the first time to medical images and believe it goes conceptually far beyond the traditional interpretability framework.

In addition, our paper makes further novel contributions: (a) We show that ensembling a plain and a robust DNN can overcome the accuracy drop of the robust model while preserving VCE quality. (b) We use VCEs to compute saliency maps in a new way with a controlled level of sparsity by exploiting the AFW optimization procedure to allow for p>1. (c) We show that even traditional saliency map techniques are more informative for the ensemble compared to a plain DNN.

Advantages of VCEs over traditional saliency maps (M,R3&R4): The reviewers perceived VCEs simply as another tool to generate a saliency map. However, it goes beyond those techniques as it can be used to generate images and even animations (see Suppl. Material) that illustrate how an image would have to change to affect the prediction of the classifier. For our task of DR classification, our method e.g. removes lesions when changing an image towards the healthy class and adds realistic lesions when changing towards the diseased class, providing a much more intuitive explanation of the behavior of the classifier.

Interpretability of the ensemble (M,R3): The reviewer expressed concern about the interpretability of the ensemble. GBP and IG were based on the gradients of the ensemble model logit with respect to the inputs, which were propagated through both models as the predicted probability of the ensemble is the average of the predicted probabilities. All saliency maps of the ensemble were more interpretable than those of the plain model, so we find the concern of the reviewer not warranted. In addition, the ensemble allowed to generate VCEs in contrast to the plain model.

Clinically relevant applications of VCEs (R4): The reviewer suggested applying VCEs to clinically relevant cases, where the lesions are either non-obvious or the prediction is wrong. We agree that this is interesting from a clinical perspective, but introducing the new method first required exploring its properties in clear cases. Nevertheless, we had included one such case(Fig. 7, Suppl. Material), showing a VCE for an image for which the DNN predicted healthy, while the original label was DR. The sequence of images that are generated clearly highlights initially non-obvious lesions in the image.

The effect of Eqs. 2 and 3 and the coefficient beta (R1,M): The reviewer suggested that adversarial training (Eq. 2) and ensembling (Eq. 3) try to control the trade-off between clean and robust performance and suggested to evaluate the effect of beta. While this is true, we could only obtain a large improvement in clean accuracy while maintaining the interpretable gradients of the robust model, by introducing the additional ensembling procedure. In our experience, tuning beta down cannot achieve the same effect as ensembling, as it negatively impacts interpretability.

Selection of saliency map techniques (R3): The reviewer wondered why we did not choose more modern saliency map techniques than GBP and IG. Recent work by Refs 6 and 33 have shown that GBP and IG are among the best techniques for generating saliency maps for DR detection, better than various LRP variants, and were therefore selected.

Implementation of oversampling (R3): Samples from the minority class (DR) were oversampled such that each batch contains an equal number of samples from both the healthy and DR classes.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal provided satisfactory feedback to the main concerns (novelty of VCEs). I understand the approach for visual counterfact with ensemble learning. Although it is just a medical application of VCEs, this paper is worth of publication in MICCAI.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

3

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper introduces an interesting idea of Visual counterfactual explanations to improve interpretability. Some concerns of reviewers (e.g., novelty) are addressed during the rebuttal. I think this paper is worthy to be presented in MICCAI.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

7

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper presents a potentially highly useful tool for visualizing the transition between disease status or health conditions, through a method called “visual conterfactual explanation” (VCE). The reviews were split and two of the reviewers (R3 and R4) question its novelty and VCE’s actual advantage compared to saliency maps or existing methods. Authors provided satisfactory feedback to the major concerns, and I do see the innovation and clinical value of using VCE in simple ensemble model to interpreting disease diagnosis and its visual difference to healthy or disease cases. Thus I recommend acceptance of this paper, while authors further improve the clarify in equations.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

7

back to top

Visual explanations for the detection of diabetic retinopathy from retinal fundus images