Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Luisa Gallée, Meinrad Beer, Michael Götz

Abstract

Interpretability is often an essential requirement in medical imaging. Advanced deep learning methods are required to address this need for explainability and high performance. In this work, we investigate whether additional information available during the training process can be used to create an understandable and powerful model. We propose an innovative solution called Proto-Caps that leverages the benefits of capsule networks, prototype learning and the use of privileged information. Evaluating the proposed solution on the LIDC-IDRI dataset shows that it combines increased interpretability with above state-of-the-art prediction performance. Compared to the explainable baseline model, our method achieves more than 6 % higher accuracy in predicting both malignancy (93.0 %) and mean characteristic features of lung nodules. Simultaneously, the model provides case-based reasoning with prototype representations that allow visual validation of radiologist-defined attributes.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_41

SharedIt: https://rdcu.be/dnwyU

Link to the code repository

https://github.com/XRad-Ulm/Proto-Caps

Link to the dataset(s)

https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=1966254


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors effectively combine capsule networks with privileged information and prototypical learning to achieve SotA results for explainable lung cancer diagnosis, even outperforming non-explainable methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors effectively combine three major areas (prototypical learning, capsule networks, and privileged information) to achieve SotA results for lung cancer diagnosis. Further these results re explainable at the human-interpretable level, something very difficult to achieve with a deep learning based approach.
    2. The authors significantly improve over the previous explainable capsules approach (X-Caps) with their ProtoCaps.
    3. The motivation for the work is solid, as explainable approach for medical imaging are sorely needed.
    4. The experiments and ablations are thorough, there are error bars reported. The showing of turning off the attribute predictions slightly improves performance (but loses explanations) is consistent with previous literature.
    5. The calculation of a dice on the reconstruction is a nice touch.
    6. The authors show that a clinician could quickly look at the predicted attributes, see that they’re wrong, and know not to trust the diagnosis (that errors in attribute predictions are good indicators of errors in malignancy prediction).
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. It would be worthwhile to examine what exactly is the correlation factor between error in attribute prediction and errors in malignancy prediction. The example of showing attributes wrongly predicted and the diagnosis being wrong is great, but a quantitive measure here would be valuable.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Great, public code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Overall very good paper and shows a lot of promise for improving clinical applicability of deep neural networks.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. Motivation is great! Explainable methods are sorely needed.
    2. Understudied area. Capsules showed a lot of promise but they haven’t made any significant breakthroughs in a long time. This work achieves SotA with a capsule network while maintaining and even improving explainability.
  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    I stand by the strong accept rating. I would strongly fight to see this paper accepted. I think there is significant novelty presented in this work, the results are strong, the experimentation is thorough (and concerns of this I think are well addressed in the rebuttal), the motivation is something sorely needed (explainable AI for medical), and the methodology is from an understudied area. All of these are the marks of a great paper and I think the authors had an excellent rebuttal to the other reviewers concerns, some of which prior to the rebuttal were good to point out, and some I agree with the authors are not fair (see final comment in rebuttal). The additional correlation factor added based on my comment was very appreciated and this showed a quantitative validation of the author’s qualitative observation and provides further value for explainability.



Review #2

  • Please describe the contribution of the paper
    • The authors propose a method that combines capsule networks, prototype learning and privileged information. Their model also provides case-based explanations.
    • Experiments are performed on the LIDC-IDRI dataset.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The authors combine capsule networks, prototype learning and privileged information, which leads to a more explainable method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors erroneously give the impression that their work is the first suggesting that inherently interpretable approaches are not necessarily underperforming models. The works of Cynthia Rudin already demonstrate that.
    • It is not clear from the paper where the novelty resides. From my understanding, it consists on the integration of privileged information. The clarity and organization of the paper is poor.
    • The evaluation is not done properly since the authors consider the results of the state-of-the-art extracted from the reports, not following the exact same training conditions.
    • The authors fail to explain the results obtained in Table 2. For instance, the performance of Sub improves with less attribution labels available.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Experiments were done with a public dataset.
    • Code will be made available.
    • Reproducibility is guaranteed.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The authors should structure better the paper, to be clear for the reader what are the exact contributions of the authors and what is already available in the literature.
    • The experimental settings should be refined, with all methods trained and evaluated following the same setting, and optimized with the same hardware.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Poor organization and clarity.
    • Experiments are not done in the proper way, not evaluating methods trained in the same conditions.
  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The authors clarify the novelty of the approach. They also provide additional details regarding the experimental setting, increasing the trust in the results.



Review #3

  • Please describe the contribution of the paper

    The work proposed a new interpretable model Proto-Caps that combines prototypes with attribution learning. The evaluation was conducted on the predictive task performance and showed it outperforms other SOTA models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strengths:

    • The proposed explanation method is novel and provides more clinically relevant information of the explanation, and the model has good predictive performance
    • Conducted thorough ablation and data reduction studies
    • Code will be made available
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Weaknesses:

    • The model architecture description is unclear regarding how the prototype vectors are translated into the input space and visualized.
    • The explanation presented in Fig 2 is hard to interpret and needs improvement.
    • The most concerning weakness, is that the evaluation was conducted only on the model’s predictive performance. Given that the main novelty of this work is the proposal of a new explanation model, there is a lack of any evaluations on the goodness of explanation, either qualitatively assessed by clinical users, and/or quantitatively and computationally assessed on the 1) faithfulness of explanation and 2) how informative the explanation can help reveal the model decision correctness. Without any evaluations of the explanation, it is difficult to assess the validity and clinical applicability of the proposed explanation method.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors state that code will be made available

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Details of the weaknesses:

    The model architecture description is unclear. For the capsule layer, it is unknown what exactly the capsule layer architecture for readers who are not familiar with capsule models. The authors should give some description of how the capsule layer is different from the convolutional layer, and what are the advantages of using them compared to the conventionally used conv layer? Moreover, from the method section, it is unknown how the prototypes are visualized as images in Fig 2. Are the images in Fig 2 are from the whole training image data, or from an image region/patch of a training image? How does the capsule or prototype vector translate back to the input space?

    In Section 2 Method, “a reconstruction head learns the original segmentation and is used for stability reasons”. I suggest the authors explain in detail what the stability reasons are. Is it to stabilize the training process or the consistency of the prediction or for other reasons?

    In the backbone description, “containing 256 kernels of size 9”, the conventional kernels are 2D, why is their size 1D of 9? The same question with the follow-up description on the kernel size of the capsule layer.

    In the loss function Eq. 4, please explain how the weights of each loss item were determined.

    In Section 4 Results, “The predicted malignancy score is justified by the closest prototypical sample of a certain attribute.” Was the justification conducted only during inference time? How was the justification performed? Is it a differentiable process and used also in the training time?

    Fig 2 the explanation is hard to interpret and needs improvement. Are the ground truth and prediction number of the malignancy indicating the same label as the gt and pred number of the attribute? Because the gt and pred labels are unclear, it is difficult to interpret from the explanation that how “these discrepancies between the prototypes and the sample nodule raise suspicion, and help to assess the malignancy prediction”. I suggest the authors replace the gt and pred numbers with real labels to improve their readability. In addition, the explanation process is also unclear. Are the shown images of each attribute the most similar image prototype? If so, please indicate their similarity score. Since in the model architecture, the predictions of attributes and malignancy are two parallel branches, then in the explanation, how does the prediction of attributes indicate a causal process to the prediction of malignancy?

    The results lack any evaluation of the explanation. The explanation consists of two parts of information: the attribute label, and the corresponding prototype image. First, it is unknown how well can the attribute label explain for the model prediction. From the model architecture, it looks like the attribute prediction and the malignancy prediction are on two independent branches similar to the multitask learning paradigm. To establish the explanatory linkage between the attribute/prototype explanation for the malignancy prediction, the authors should conduct evaluations on the faithfulness of the explanation to the prediction. See “Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness?” (https://aclanthology.org/2020.acl-main.386/) for faithfulness evaluation. Second, since the authors claimed that “The misclassification is explained by the false attribute predictions”, the authors should conduct an evaluation to establish the validity of such a claim. The evaluation can be conducted qualitatively by conducting user study with clinical users, or quantitatively by measuring the correlation between the plausibility of explanation and the correctness of malignancy prediction.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Given that the main novelty of this work is the proposal of a new explanation model, but there is a lack of any evaluations on the goodness of explanation, it is difficult to assess the validity of the proposed novel explainability method.

    The reported evaluation is only on the model’s predictive performance, not the explainability performance. The explainability evaluation of faithfulness is an important evaluation to establish that such generated explanation can truly explain the model by being faithful to the decision process. Otherwise, the generated “explanation” is just the result from the multitask learning process and have nothing to do with “explanation”.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    I read the rebuttal and acknowledge the authors properly revise their conclusion and contribution statement to properly reflect their works. But I am not clear about why the newly added two dataset on chest X ray were only focusing on one class? Does such classes selectively presented or not? If I have such information, I will be more confident to raise my score, but since such info is missing, I would like to keep my original score.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Strengths

    • The proposed method effectively combines prototypical learning, capsule networks, and privileged information for lung cancer diagnosis
    • The proposed explanation method looks novel and provides clinically relevant information
    • The model’s performance is not decreased
    • The motivation is solid

    Weaknesses

    • Quantitative measure on interpretability is missing (clinical user evaluation or faithfulness evaluation of explanation)
    • Experimental setting looks not fair because training conditions are different.
    • The clarity and organization of the paper can be improved
    • Some details (e.g., model architecture, visualization method, and weights in loss function) are missing

    The method looks novel and interesting. However, two reviewers have concerns with the experiments and clarity of the paper. A quantitative evaluation of the explanation is missing, which makes it difficult to assess the validity and clinical applicability of the method. I would recommend the authors address the concerns of reviewers during rebuttal.




Author Feedback

We thank all reviewers for their valuable comments and appreciate the consensus view of the explainability of our method (R1, R2, R3), the novelty and motivation (R1, R3), the good predictive performance (R1, R3), and the thorough ablation study (R1, R3).

The main novelty of our work is the methodical fusion of complementary explainability approaches: attribute-based capsule networks from [1] (LaLonde et al., MICCAI 2020) are extended with visualizations of prototypes using new prototypical capsules and other architecture improvements. This leads to additional explanations and better results compared to [1].

Combining multiple complex techniques is always challenging and leads to trade-offs between space limitations and describing details or concepts. This might be the reason for the mixed reviews on clarity and organization, which we take seriously. Thank you for pointing it out and naming areas to improve. We will let non-involved persons check the paper, revise it carefully, and add clarifications, including (1) For each attribute prototype the respective original image is being saved and used for visualization (as in Fig. 2). (2) Building on [1], we also used the segmentation head they showed to be beneficial. (3) We clarify that we add to the already existing argument of high-performance explainability by strengthening the link to the citation of Rudin et al. (ref. [13] in the paper). (4) Capsular networks are used to strengthen the connection between attribute prediction and target prediction, as each capsule represents one attribute and all capsule vectors are used for predicting the target (as in [1]).

We agree with R3 that validating the interpretability of XAI methods is important, but doing it scientifically sound easily exceeds the range of a methodical paper. A weak user study can lead to noisy or even wrong conclusions if the number of participants is too small or other effects like confirmation bias are not considered. Consequently, methodological conference papers even for high-ranked venues omit such experiments (c.f. Chen et al., NeurIPS 2019, LaLonde et al., MICCAI 2020, Kim et al., CVPR 2021). We decided to do this too, especially since we combine introduced paradigms, namely attribute-based explanations and prototypes. Nevertheless, we considered it during the development and ran experiments about interpretability. An internal user study about the helpfulness of additional prototype visualizations showed an increase from 1.8 to 3.3 (1: not trustworthy, 5: very trustworthy). Due to the above-mentioned reasons, we choose to not make these results public. Following R3 and the improvement suggestion from R1, we evaluated the objective faithfulness of our approach, by examining the relationship between correctness in attribute and in target prediction using logistic regression analysis. An ACC of 0.93/0.1 shows the strong relationship between both and we will add this to the paper.

We disagree that the SotA comparison is unfair (R2), but agree that the paper could lead to this impression. Using reported results is an accepted and common procedure (used in LaLonde et al., MICCAI 2020, Y. Xie et al., IEEE TMI 2019, Liu et al., MICCAI 2021, and many more). To ensure fair experiment settings and methodological comparison, we do use the same data, data preprocessing and splitting, model architecture, loss functions, and branch weights (despite mentioned additional novel branches), and use the same evaluation metrics as the method we mainly compare to. Comparing with literature values also means that all approaches were properly implemented and optimized, which is not guaranteed otherwise. It can be difficult for the reader to get these points from the paper and consequently we will add the information to the final version.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper has novel contributions. Although there were some concerns on the evaluation, the authors have successfully addressed them and three reviewers agree with the acceptance of this paper. It would be great to update the final version as in the rebuttal.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors effectively addressed the concerns of the reviewers on novelty and interpretability raised from the original reviews. The topic is very important and this can be an interesting presentation at MICCAI.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper has significant novelty, thorough evaluation and strong results. The reviewer concerns were adequately addressed in the rebuttal. The code will be made public and the datasets used are public to ensure the reproducibility of the proposed approach.



back to top