Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Alfie Roddan, Chi Xu, Serine Ajlouni, Irini Kakaletri, Patra Charalampaki, Stamatia Giannarou

Abstract

The deployment of Machine Learning models intraoperatively for tissue characterisation can assist decision making and guide safe tumour resections. For the surgeon to trust the model, explainability of the generated predictions needs to be provided. For image classification models, pixel attribution (PA) and risk estimation are popular methods to infer explainability. However, the former method lacks trustworthiness while the latter can not provide visual explanation of the model’s attention. In this paper, we propose the first approach which incorporates risk estimation into a PA method for improved and more trustworthy image classification explainability. The proposed method iteratively applies a classification model with a PA method to create a volume of PA maps. We introduce a method to generate an enhanced PA map by estimating the expectation values of the pixel-wise distributions. In addition, the coefficient of variation (CV) is used to estimate pixel-wise risk of this enhanced PA map. Hence, the proposed method not only provides an improved PA map but also produces an estimation of risk on the output PA values. Performance evaluation on probe-based Confocal Laser Endomicroscopy (pCLE) data verifies that our improved explainability method outperforms the state-of-the-art.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_54

SharedIt: https://rdcu.be/dnwzm

Link to the code repository

https://github.com/alfieroddan/Explainable-Image-Classification

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper, authors developed a method to compute the class activation maps of the neural network using dropout based ensemble method. The resultant activation map is computed as an average of the activation maps of the ensemble predictions. Authors also computed the disagreement/ scaled variance between the activation maps to compute the pixel-wise uncertainty or trustworthiness of the computed activation map.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea of computing class activation maps (CAMs) using ensemble to compute the uncertainty of CAMs is a nice idea.
    • Authors have evaluated several CAM methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors use number of forward passes with dropout during test time T=100. Do we need such large number of forward passes. The decision should be guided by experiment.
    • It wold be interesting to see the correlation between uncertainty of the CAMS and actual classification performance.
    • The evaluation only shows the effectiveness of CAMS, however, the effectiveness of CAM uncertainty is not evaluated. For example, what about average dropout performance of region that have higher uncertainty.
    • Performance of dropout ensemble based Score CAM is not included in the evaluation. Authors have given computation time as the reason for omission, however, you should be able to compute in reasonable time if you use lower value of T.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The details are present in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • In the abstract, the acronym PA is defined before describe it.
    • Please refer to section 6 (main weaknesses)
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposed a nice idea of computing CAM uncertainty. The paper requires additional experimental validation to bolster its strength.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The work proposed a method to estimate the uncertainty in the saliency map by utilizing MC dropout. The explanation is presented by visualizing the mean and normalized standard deviation of the distribution of the saliency map. The evaluation of the surgical brain tumor image datasets showed good explainability performance on multiple aspects of the explainability task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strengths:

    • The proposed method is potentially innovative because there are few works (but not none) that address the uncertainty aspects of saliency map explanation.
    • The evaluation is thorough on two different backbone models and five saliency methods, with metrics covering different aspects from explanation faithfulness, coherence, simplicity, and speed.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Weaknesses:

    • Lacks literature review and comparisons on the closely related works on the uncertainty estimation of explanation. There is a MICCAI 20 paper (see 9.5) and other closely related works in computer vision. It is hard to assess the novelty given this.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    No ethics approval statement is provided in the manuscript for the private dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Details of the weaknesses:

    1. There is no full name of “PA” in the abstract.

    2. In the introduction, “it is vital that the surgeon trusts the output predictions of the model otherwise the model is rendered useless”. Because the ML prediction is error-prone, it is unethical to let physicians trust all of the output predictions with explanations as the predictions may be wrong. The physicians’ trust in the model output can be calibrated by the explanation, so that doctors can detect the wrong predictions from AI via its explanations. The goal to use explanation is not to enhance trust in a predicted output, but to calibrate the user’s trust in a predicted output. Please revise accordingly.

    3. “Surgeons in practice can use this risk during diagnosis to trust the model for decision making [REF].” This sentence is missing the reference.

    4. Could the authors provide a clear definition of trustworthiness? It is used as an important judgmental perspective of different methods, but it is unclear what exactly the authors are referring to by saying trustworthiness.

    5. The authors failed to review directly related works on applying uncertainty estimation to generate explanations. The following statement needs to be revised accordingly and clearly state the differences and novelty compared to prior works: “This volume is used for the first time, to generate a pixel-wise distribution of PA values from which we can infer risk.” E.g.: in MICCAI 20 “Efficient Shapley explanation for features importance estimation under uncertainty”. There are other closely related works in natural image tasks.

    6. What does sigma in Eq 5 stand for?

    7. In Eq. 3 the calculation of the average drop evaluation metric, how does the performance confidence drop measured? Were the dropout turned off in the average drop evaluation? The average drop is to assess the faithfulness of the explanation to the model decision process. But since the explanation is based on the ensemble of multiple models with different dropout connections, it is unknown which model the explanation is faithful to.

    8. Please provide a short explanation of why the model’s predictive performance was not reported. Although it is not necessary to explicitly report the model performance given the novelty and contribution are mainly on the explainability task, it would be better to have the model performance information

    9. It would be better if this work can be evaluated by the surgeons, especially on how they interpret and synergize the information from both the mean and normalized standard deviation of the saliency maps. This would add more knowledge and insights into the clinical relevance of such kind of explanation.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    If the authors can justify the novelty of the work by comparing the differences with prior closely related works on uncertainty estimation for explainability, the novelty claim may hold in this case.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    I appreciate the authors addressing my main concerns on the comparison with related work and the clearness of presentations.



Review #3

  • Please describe the contribution of the paper

    Explainable and trustworthiness are two key factors for the deployment of machine learning models. This paper propose the first approach which incorporates risk estimation into an explainability method. This fusion leads to improved explainability of model predictions. Specifically, this paper introduces a method to generate a volume of pixel attribution (PA) map. Based on the generated PA map, the coefficient of variation is used to estimate pixel-wise risk. Experiments conducting on pCLE data show that the proposed method improve explainability.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method fuses the advantages of pixel attribution and risk estimation method. This kind of fusion improves the explainability as well as the trustworthiness for tissue classification model.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The discussion about the PA volume is limited. Although the introduction to generating PA volume is clear and easy to follow, the rationale behind this design is not very clear.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The overall method description is clear. The method is straightforward and It should be easy to reproduce the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. More analysis on Table 1 is necessary. This paper use five metrics to evaluate the performance of proposed method. It seems that the proposed not always achieves best.
    2. It would be better to add some ablation studies on some key parameters, such as dropout rate, the iterations T. How would these parameters affect the model performance.
    3. The explanation on figure 1 and figure 3 is not very clear.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The experiments are not sufficient. There is no ablation study to demonstrate the properties of proposed PA volume.
    2. The illustration in Fig 1 and Fig 3 is hard to follow.
  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    The study related to PA volume is still limited.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper introduces an idea of CAM uncertainty. Experiments are conducted on two backbone models with various metrics for evaluating faithfulness, coherence, simplicity, and speed of different methods. Results show the effectiveness of the proposed method over various CAM variations. However, reviewers have concerns with limited experiments (please find details in reviewer comments). Also, some details and design rationale are missing in the current version of the paper. Some related studies on uncertainty estimation of explanation are missing, which makes it difficult to clarify the novelty of the proposed method. It would be great to address the concerns during the rebuttal. In particular, it will be great to clarify the novelty of this work compared with other uncertainty estimation of explanation.




Author Feedback

We thank the reviewers for their constructive comments. Minor issues will be addressed in the revised paper. R 1 & 3

  1. Value of T and Dropout Rate The value of T was set to be high enough to give a fair distribution of PA values, T>100, but also not comprising too heavily on computational complexity i.e T=100. Dropout rate was set to 0.1 inline with [11]. ADCC cannot be compared between differently trained models, thus Dropout study is not meaningful. R 1
  2. Correlation between uncertainty and classification performance. We qualitatively investigated the effect of a poorly performing model trained without augmentation and visually noticed on the PA maps that the average noise is reduced. PA methods are not affected by the classification performance, rather by how robustly they are trained.
  3. Effectiveness of CAM uncertainty The effectiveness of CAM uncertainty is qualitatively shown in Fig. 1 & 2. where we confirm that PA values have not only high value but low variation. Low variation providing a level of precision to the PA value.
  4. ScoreCAM not included. The authors of ScoreCAM have not released a batched implementation of ScoreCAM so, time values of T even at low T are large, T=10 (ADCC=0.606) was still 3.75 seconds. Still unfeasible to evaluate. R 2
  5. Ethics All studies on human subjects were performed according to the requirements of the local ethic committee and in agreement with the Declaration of Helsinki (No. CLE-001 Nr: 2014480). This statement will be included in the paper. 2.Goal of explainability We agree with the reviewer, and we will clarify this statement in our paper.
  6. Definition of trustworthiness Trustworthiness means how reliable a model’s explanation is in a clinical scenario. Trustworthiness is improved by showing variation level of PA values by perturbing the model with Dropout.
  7. Related work on uncertainty The differences between the two methods are: • DistDeepSHAP is a model agnostic interpretability method which permutes an input image to find which pixels are most important to the model. It is not a classs activated (PA) method. • DistDeepSHAP samples from the training dataset for permutations. This requires the training dataset to be in local memory. This is not feasible in most real time / deployable applications. • Our method finds the most important pixels using CAM methods designed for CNN’s. • ADCC is a specific metric for CAM methods, it is not comparable for DistDeepSHAP. We will revise the paper to include the related work and state the above differences.
  8. Sigma in Eq 5 sigma is the standard deviation, this will be included.
  9. Performance confidence drop In Eq. 3, no performance confidence drop metric is involved.
  10. Was the dropout turned off in the average drop evaluation? Dropout is on at test time for our method.
  11. Faithfulness of the explanation to one model By averaging over PA maps from multiple models we are not faithful to one model but multiple explanations, while suppressing noise.
  12. Model’s predictive performance The classification Top 1 accuracy is: • Resnet18 : 94.0% • MobileNet: 86.6% These results will be included in the revised paper.
  13. Surgeon feedback Surgeons in our team suggested that PA methods can be used in a clinical setting to prove that the model has been trained correctly, the coefficient of variation can indicate that the explanation map is precise. R 3
  14. PA volume design With a batched implementation of MC Dropout we can predict multiple PA maps with one forward pass with different model architectures (see [11] for more). Using the pixel-wise average, we improve the ADCC of all PA methods. 2.Performance metrics on Table 1 As explained in “Evaluation Metrics”, only ADCC combined with computation time give us a reliable overall metric of a PA method. Our method improves on ADCC for all PA methods.
  15. Explanation on figures We will improve the explanation in the revised paper.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper introduces an interesting idea of CAM uncertainty. There is a spread of scores (weak reject to weak accept). The authors have addressed issues during the rebuttal. I think strengths outweigh the weaknesses. I would sugges ‘Accept’. It would be great to update the manuscript by considering the comments. In particular, it would be great to add more discussion on PA volume



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper studies the uncertainty of the CAM using ensemble approach. The results are (for the most part) qualitative. The better design of the experiments is needed plus ablation study and real evaluation of the uncertainty of the CAM, which is pointed out by 2 reviewers.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal partially addressed some of the concerns raised by the reviewers. However, the main limitations remain. There have been a few papers on the trustworthiness of saliency maps in the past year, including Arun et al., Radiology AI 2021, and Zhang et al., MICCAI 2022. However, those works are not included to show the contribution of the presented in the right context. In addition, the authors should pay more attention to the details. For example, despite R2 pointed that PA is not fully spelled out in the abstract, the authors didn’t provide it in the rebuttal.



back to top