Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Alexander C. Jenke, Sebastian Bodenstedt, Martin Wagner, Johanna M. Brandenburg, Antonia Stern, Lars Mündermann, Marius Distler, Jürgen Weitz, Beat P. Müller-Stich, Stefanie Speidel

Abstract

In computer-assisted surgery, artificial intelligence (AI) methods need to be interpretable, as a clinician has to understand a model’s decision. To improve the visual interpretability of convolutional neural network, we propose to indirectly guide the feature development process of the model with augmented training data in which unimportant regions in an image have been blurred. On a public dataset, we show that our proposed training workflow results in better visual interpretability of the model and improves the overall model performance. To numerically evaluate heat maps, produced by explainable AI methods, we propose a new metric evaluating the focus with regards to a mask of the region of interest. Further, we are able to show that the resulting model is more robust against changes in the background by focusing the features onto the important areas of the scene and therefore improve model generalization.



Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_12

SharedIt: https://rdcu.be/cVRsY

Link to the code repository

https://gitlab.com/nct_tso_public/gft

Link to the dataset(s)

https://www.synapse.org/#!Synapse:syn25101790

https://www.synapse.org/#!Synapse:syn18824884


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a mask input-based training scenario that separates the training target from the background for the design of explanatory AI models. As a target task for explainable AI, a multi-class classification problem was targeted, and an instance segmentation model was utilized for mask generation. SmoothGrad [10] was applied to visualize the feature output of the model. The authors propose the eCDF-Area method to evaluate the explanatory power of a model, and at the same time show that it is possible to train with mask-based synthesized images to minimize the burden of labeling in guided feature training.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors claim that the explanatory power of the model is improved when the proposed guided feature training is performed. Authors also propose the eCDF-Area method to quantify the explanatory power of enhanced artificial intelligence. For quantification of heat map-based explanatory methods such as GradSmooth, the authors utilize eCDF-Area methods with target RoI information to be recognized. At the same time, guided feature training increases the generalization performance of the model, which can be seen in SmoothGrad visualization.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Overall, it was difficult to know the purpose of the proposed study. The proposed guided feature training looks like a variant of the general feature fusion-based training method. The fused training with additional labeling information can improve recognition performance and enhance feature visualization such as GradSmooth in the model. In conclusion, I am not sure if guided feature training can be viewed as a branch of XAI research.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors will publish all relevant code, datasets, and models.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    If the proposed study focuses on the XAI point of view, it is an opinion that it may be better to focus on the proposal of eCDF-Area metrics for visualization methods that try to explain the output of models such as Grad-CAM, SmoothGrad, SmoothGrad-CAM++, etc. The training scenario in Figure 1 should be expressed more concretely in terms of the input and output of the model.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It may be correct to view guided specific training as an improvement in the training model due to multiple supervision rather than XAI research.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    Nevertheless, the utility of XAI for instrument localization in visual perception problems is still questionable. The idea of using ‘Blur’ is explicit, but it seems to be a continuation of former works like CutOut data augmentation or copy & paste for the instance segmentation. It seems that XAI needs to be expanded to fields such as surgical phase recognition that require more explation. The author’s responses are reasonable, but I have decided to keep the original score.



Review #2

  • Please describe the contribution of the paper

    This work proposes a data augmentation technique, named Guided Feature Training (GFT) for improving the interpretability of a deep learning model for the task of surgical instrument detection. The augmentation technique proposes to blur the background (everything that is not a surgical instrument) to guide the feature training towards the instruments and disregard other less useful information for the task (e.g. the background). The proposed augmentation requires a binary segmentation model (or binary segmentation annotations) that segment surgical instruments.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea of blurring the background for guiding the feature learning of the features is simple and interesting for improving the interpretability. The authors propose a metric (eCDF) to measure to quantify the interpretability or focus of the model. A simple figure could help the reader to better understand the eCDF metric.
    • The proposed method seem to improve the ‘focus’ of the model on to the surgical instruments in the samples shown by the authors and in the performed experiments
    • The paper is well presented, written, and easy to understand
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors evaluate the proposed approach in three settings. The third one as stated by the authors is “investigating the effects of GFT in its realistic application”. Table 1 shows the results of these three settings. The one in realistic applications, in which segmentation is automatically inferred, shows that a conventional training (‘none’) obtains an average F1 of 76.7%, using blurred backgrounds (‘blurred’) obtains 74%, and using non-blurred and blurred images (‘combo’) obtains a 77.3%. This suggests that using only blurred images damages the performance (though improving the interpretability). The experiment with ‘combo’ seems to slightly improve the performance (+0.6%). However, it seems that there is no experiment that shows wether this small improvement comes from using non-blurred + blurred images or simply by training on double amount of (repeated) data. Authors should evaluate the model in a fourth setting ‘naive combo’ where they use double amount of (repeated) training data without any blurring to obtain a fair evaluation.
    • It is difficult to assess a paper for which the data is not public at the time of submission.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Authors will provide the code and dataset upon acceptance. However, these were not available at revision time so that they could not be assessed.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Perform further experiments as (or similarly) than the proposed in the point above
    • Further justify (and or add a reference) to certain statements. For instance “As preliminary experiments have shown, common models that determine if a certain instrument type is currently visible (instrument presence detection) lack focus on the instruments themselves and often decide based on features in the background.” Where has this been shown?
    • Improve figure captioning. For instance, add description of what the graph shows in Figure 3, specially on the horizontal axis; and improve the caption of figures including what the reader should extract from them.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The idea of blurring the background for improving the interpretability seems to be novel
    • The idea for quantifying the increase in ‘interpretability’ trough focus is interesting
    • The performance (F1-Score) seems to slightly hurt or at most obtain comparable results than traditional training (without the idea proposed by the authors)
    • Additional experiments to support that the proposed idea improve the performance (on top of increasing the interpretability) will strength the paper
  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    Improving the interpretability/generaisability of NN’s using data augmentation and guided feature training for a surgical application, utilising numerical evaluaton of a model with a proposed area metric called eCDF.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The eCDF-Area metric is an interesting measure introuced for numerical evaluation with heat maps and the mask for the region of interest.
    2. Using the features to guide improvements is a nice approach and made for an interesting read.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Interesting results, just thinking that if it performed well on a small dataset, perhaps we don’t have a clear indication of performance but a start of a particular trend. My concern is probabably the different combinations of original/modified datasets - what/how is an optimal vlaue chosen? This is not clearly defined/explained (page 3).

    1. On page 3, perhas a more technical word/wording for ‘decent approach’, as it is a formal paper.
    2. I would like to see more statstical testing performed to compare model performances.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Details in terms of calculations are clear, code will be provided. The data is publicly available but the details re images might not be as clear/shared with and in this way one could face a bit of problems replicating the work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Thank you for the work performed for this paper, I have some points to kindly raise: 1.1 Table 1: 0% for the clipper, for automated masks? Why is that, do you know? 1.2 Table 1: I notice the Scissor performed less optimally compared to other instruments, any thoughts around this? 1.3 Thoughts about the combo providing the best outcome? Do you think this is because a model needs to see more images not similar to the ideal in order to train more generaisable? 1.4 Why only train with false negatives for guided feature training? 1.5 Figure 1 - I assume you first train the model then retrain? Sorry this needs to be a little clearer unless I have misinterpreted this. 1.6 Protocol for the video/cine stack - not sure this is clear 1.7 My concern is probabably re the different combinations of original/modified datasets - what/how is an optimal value chosen - see previous comment - I don’t see the ratio/size given for each?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think overall, a well layed out paper, some methodologies to be given in more detail, but a nice approach that is interesting to read about.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Strengths:

    • This paper introduces a novel way to improve the interpretability of the model by using guided feature training.
    • The proposed idea is simple and interesting.
    • A new metric of eCDF-Area is introduced to quantify the explanatory power of the model.

    Weaknesses:

    • The ablation study is missing, which makes it difficult to understand the source of the performance improvement (Q5 of Reviewer 2)
    • The experiments are conducted only on the private small dataset.
    • The captions of the figure need to be improved.
    • Some details of the experiments are missing.
    • The positioning of the paper might mislead the readers. In particular, more discussion would be needed to have this paper in XAI.

    Overall: Three reviewers agree that the proposed paper introduces an interesting way and a metric to quantify the explanatory power of the methods. However, there are some concerns about the current version of the paper and it will be great to address them during the rebuttal. In particular, it will be great to address the following points.

    1. More discussion on the method in terms of XAI would be needed because the current version shows the improvement of the method with additional supervision, which looks like a variant of the feature-based training rather than XAI (reviwer#1).
    2. The ablation study is not well designed, which makes it difficult to fully understand the reason for the performance improvement (reviwer#2). The discussion on the effect of guided feature training will be needed.
    3. It will be great to further provide more justification on the design choices and detail information of the method and evaluation (Reviewer #1, #2, #3).
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4




Author Feedback

We thank all reviewers for their constructive feedback. In the following, we address the main points of criticism and report the results of the proposed ‘naive combo’ guidance. The final version of the paper will be modified accordingly.

Subject Area: R#1 states that GFT might be viewed as a technique improving the training rather than XAI research. However, while GFT is not a XAI method itself, its primary aim is to improve the output of visual XAI methods. GFT aims to guide the features explicitly to areas which a human expects to be highlighted when analyzing a classification prediction with visual XAI. This effect is directly evaluated with the introduced eCDF-Area metric. We therefore argue GFT can be seen as research in the XAI domain, because it aids in making predictions more explainable.

Methodical delimitation: R#1 notes that the method looks like a variant of a Feature Fusion-Based (FFB) Method. To our understanding FFB is trying to fuse features of multiple domains to improve the representation. We on the other hand only use features of the image domain and do not fuse them in any way. Further, FFB usually requires a special model architecture. GFT only modifies the training process without interfering with the model architecture itself and can be applied to all classification architectures as long as they can somehow be split into feature & classifier parts.

Figure clarity: R#1 & R#3 note that the shown training process in Figure 1 is unclear: As R#3 correctly assumes the used model is firstly trained in the “guide features” step. Subsequently the models’ classifier is finetuned by retraining the model on the unmodified dataset while freezing the models’ feature layers. The model input consists of modified and/or unmodified images and outputs a binary multilabel classification of the tool presence. This figure and other figure captions will be improved in the final version.

Datasets decisions: R#3 expresses concerns on the selection of the ratio of original/modified datasets. For ‘none’ and ‘blurred’ we use either one or another dataset, only in ‘combo’ both the original and modified frames are combined in a 1:1 ratio by concatenating both datasets, as described on page 3. Varying ratios of original and modified frames might be further investigated in future work. We opted for this simple way of concatenating the datasets to optimally use the available data, as especially in the first and second trial we dealt with small datasets.

Additional experiment: R#2 proposes an additional experiment using a ‘naive combo’ dataset which consists of two times the original dataset. Thereby the dataset is similar in size to the combo dataset and effects of the differing dataset sizes between none/blurred and combo can be investigated. The results (F1 for Grasper/ Clipper/ Coagulation/ Scissors/ Suction / mean F1) are 0.55/ 0.14/ 0.77/ 0.22/ 0.43 / 0.42 for the first trial, 0.72/ 0.02/ 0.77/ 0.21/ 0.59 / 0.46 for the second trial, and 0.88/ 0.83/ 0.91/ 0.54/ 0.77 / 0.78 for the third trial, all evaluated on the test data sampled with 1 fps.

The eCDF-Area Metric evaluated on the training data for the first two trials (as shown in Fig. 3a) resulted in a distribution in between the ‘none’ and ‘blurred’ guidance with a mean eCDF-Area of 0.16 each. The evaluation of the third trial on the fake images (Fig. 3b) resulted in a distribution similar to ‘none’ with a median eCDF-Area of 0.26.

In all three trials the ‘combo’ guidance mode improved the models’ focus. In the third trial ‘naive combo’ resulted in the highest mean F1-Score, although, ‘none’, ‘combo’ and ‘naive combo’ lie very close together. In the first and second trial the models focus and performance benefited from the bigger dataset size, but has not reached ‘combo’. The additional experiment has shown that GFT improves the focus and especially small datasets can benefit from it.

The additional results and extended figures will be provided in the supplementary.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    All reviewers agree that the proposed paper introduces an interesting way and a metric to quantify the explanatory power of the methods. Some concerns are pointed out during the first stage but I think the authors successfully address them during the rebuttal. This paper would be able to provide an interesting insight. I would recommend acceptance of this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Although the reviewers have certain reservations about the paper, the rebuttal helps alleviate the concerns about the paper. The proposed guided feature training can be an interesting presentation on XAI in medical image analysis.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposed a data augmentation technique for improving the interpretability of a deep learning model in surgical instrument detection. The main concerns rasied by reviewers are experiments, especially missing of ablation study. After rebuttal, those concerns are still here. I think a more comprehensive experiments and discuss are significant for a MICCAI paper. Therefore, i recomment this paper as reject.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



back to top