Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Manuel Tran, Amal Lahiani, Yashin Dicente Cid, Melanie Boxberg, Peter Lienemann, Christian Matek, Sophia J. Wagner, Fabian J. Theis, Eldad Klaiman, Tingying Peng

Abstract

Vision Transformers (ViTs) and Swin Transformers (Swin) are currently state-of-the-art in computational pathology. However, domain experts are still reluctant to use these models due to their lack of interpretability. This is not surprising, as critical decisions need to be transparent and understandable. The most common approach to understanding transformers is to visualize their attention. However, attention maps of ViTs are often fragmented, leading to unsatisfactory explanations. Here, we introduce a novel architecture called the B-cos Vision Transformer (BvT) that is designed to be more interpretable. It replaces all linear transformations with the B-cos transform to promote weight-input alignment. In a blinded study, medical experts clearly ranked BvTs above ViTs, suggesting that our network is better at capturing biomedically relevant structures. This is also true for the B-cos Swin Transformer (Bwin). Compared to the Swin Transformer, it even improves the F1-score by up to 4.7% on two public datasets.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43993-3_50

SharedIt: https://rdcu.be/dnwNV

Link to the code repository

N/A

Link to the dataset(s)

https://zenodo.org/record/1214456

https://doi.org/10.7937/tcia.2019.36f5o9ld

https://portal.gdc.cancer.gov/projects/TCGA-COAD

Reviews

Review #3

Please describe the contribution of the paper

This paper proposed a straightforward idea: replacing the linear layers of ViT with the B-cos transform (and subsequently removing ReLU), in order to address known widespread issues with interpretability of attention heatmaps. The authors train ViT/BvT models on several pathology-relevant datasets, compare performance, and have domain experts compare their ability to interpret the attention maps of each model. The authors find that BvT is more interpretable. They further derive a B-cos variant of the Swin transformer.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- This paper addresses a real need in computer vision pathology applications. Interpretability is important not just in identifying failure modes, but in helping clinicians become comfortable with computer vision tools.
- The authors collect actual quantitative data on interpretability, which is less common than simply making qualitative statements. Thus, their claims around interpretability are quantitatively justified via their evidence.
- The authors test their model on multiple datasets and demonstrate consistency.
- BvT may have applications outside of pathology as well.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The performance drops for the BvT model is concerning, as a tradeoff between clinical accuracy and interpretability would be a potential barrier to implementation in the clinic. It’s not clear to me why the author’s proposed explanation would lead to a 5% performance gain when testing generalization.
- Is the training paradigm sufficient? The use of only 20-30 epochs seems potentially low.
There are minor statistical issues:
- If rankings are ordinal rather than continuous, the violin plots in Fig 6 are likely unhelpful for visualizing the rankings vs. e.g. a confusion matrix. Same for S1.
- Likewise, in S1, a student’s T test is not the correct method of comparison between the methods.
- Performance metrics should ideally be bootstrapped for a CI.
- N for the comparisons in Fig 6 should be provided in the main text.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Sufficient information is provided
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Please see “weaknesses”. Further discussion of the performance differences is warranted, since these are non-trivial.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is a clever and straightforward extension of existing methods with the potential to impact how clinical-grade AI pathology systems are designed. The results are clear and compelling.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The main contribution of the article is using B-cos transforms as a drop-in replacement for the linear transforms in the ViT to obtain better interpretable features in histopathological tasks. The authors advocate the proposed BvT can be used as a more explainable alternative to ViT, and subsequently add transparency and interpretability for the clinical decision. The authors also discovered that the proposed BvT outperforms the original models in certain classification tasks.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The work proposed an interesting adaptation of the Transformers to gain better explainability based on the work in the ref [7] about alignment as Fig. 6 shows.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Only patch classification was evaluated as a downstream task, which is not convincing enough. The popular tasks of semantic segmentation or object detection or slide-level classification can be further evaluated to show that the proposed architecture is generally beneficial in terms of feature extraction. Details on the example tasks on the three public dataset is not provided. The authors mentioned the Transformer-based architectures have various applications on histopathological tasks such as classification, segmentation, survival prediction, and mutation detection. And the model interpretability plays an important role. Hence the example datasets and tasks should be representative. Authors may need to illustrate more on dataset and task selection.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Not clear to me.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

In the results and discussion, the authors believe that the poor training of the BvT model is due to the simultaneous optimization of the two objectives. This can be further investigated by adjusting the optimization weights of these two objectives.

May need to further explain and analyze the results presented in Figure 6.

Can you recall the reader the rationale about adding [1 − r, 1 − g, 1 − b] to the RGB channels [r, g, b] to gain in alignment and interpretability (ref 7) as it appeared important. also resolution of 0,5 mpp ?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The work is an interesting extension of ViT and Alignment in ref 7. The assessment is worth being more fair and detailed.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

This paper introduces a novel vision transformer variant - the B-Cos transformer (BvT) in an effort to improve the explainability of vision transformer models in computational pathology. The authors present their motivation for this model, and apply it to 3 independent WSI datasets. The authors confirm through a robust study of BvT’s attention maps with 2 domain experts that the attention maps of BvT are preferred to the attention maps of ViTs since they highlight diagnostically relevant areas of the tissue.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This is an extremely well written paper with clear and concise methodological concepts presented throughout The main contribution of this paper is the replacement of the linear transformations in ViT with B-cos transform. At face value, this seems like an arbitrary change, however the authors justify their reasoning for making this change well Since this is paper about explainability, the evaluation of explainability should be robust and carried out by domain experts, which is the case here. The methodology for evaluating attention maps here is of great value to the community. This paper adds to the growing body of literature concerning problems with transformer attention as an interpretability method and presents a step change in the right direction The authors are honest about the limitations of their method, stating that training from scratch results in reduced performance
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Since this paper is about improving the interpretability of transformer attention, I am not sure if the comparison between the author’s method and other methods of interpretability is necessary. I would prefer to see a more in depth comparison of the attention of BvT and ViT (an extended version of figure 1, perhaps removing figure 3) Are the attention heads that are compared in figure 1 comparable? How do we know that head 1 in ViT and head 1 in BvT learnt to attend to similar tissue structures and therefore can be compared? How many heads do ViT and BvT have in this study? Can the authors make it clearer how the community should react to this study? For example, the authors state that training from scratch leads to worse performance. Is the recommended action to do transfer learning only? Do the authors recommend that we sacrifice performance for better explainability? Since this paper is about improving the interpretability of transformer attention, I am not sure if the comparison between the author’s method and other methods of interpretability is necessary. I would prefer to see a more in depth comparison of the attention of BvT and ViT (an extended version of figure 1, perhaps removing figure 3) Are the attention heads that are compared in figure 1 comparable? How do we know that head 1 in ViT and head 1 in BvT learnt to attend to similar tissue structures and therefore can be compared? How many heads do ViT and BvT have in this study? Can the authors make it clearer how the community should react to this study? For example, the authors state that training from scratch leads to worse performance. Is the recommended action to do transfer learning only? Do the authors recommend that we sacrifice performance for better explainability? There are methods like HIPT, that have shown that pure ViTs are capable of producing biologically relevant attention maps, how do the authors respond to this? Does BvT only produce better attention maps in some cases? When the domain experts were provided with attention maps, which head(s) were shown, and how were they chosen? This is an important detailWhen the domain experts were provided with attention maps, which head(s) were shown, and how were they chosen? This is an important detail
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

good, if the code will be public
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Since this paper is about improving the interpretability of transformer attention, I am not sure if the comparison between the author’s method and other methods of interpretability is necessary. I would prefer to see a more in depth comparison of the attention of BvT and ViT (an extended version of figure 1, perhaps removing figure 3) Are the attention heads that are compared in figure 1 comparable? How do we know that head 1 in ViT and head 1 in BvT learnt to attend to similar tissue structures and therefore can be compared? How many heads do ViT and BvT have in this study? Can the authors make it clearer how the community should react to this study? For example, the authors state that training from scratch leads to worse performance. Is the recommended action to do transfer learning only? Do the authors recommend that we sacrifice performance for better explainability? There are methods like HIPT, that have shown that pure ViTs are capable of producing biologically relevant attention maps, how do the authors respond to this? Does BvT only produce better attention maps in some cases? When the domain experts were provided with attention maps, which head(s) were shown, and how were they chosen? This is an important detail.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper adds to the growing body of literature concerning problems with transformer attention as an interpretability method and presents a step change in the right direction
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
To enhance this study, we offer the following suggestions regarding its strengths and weaknesses:

Summary of Key Strengths:
- The paper is exceptionally well-written, presenting clear and concise methodological concepts throughout.
- The proposed adaptation of Transformers for improved explainability is intriguing.
- The authors provide proper justification for replacing linear transformations in ViT with B-cos transform.
- The evaluation of explainability is robust and conducted by domain experts.
- The methodology for evaluating attention maps is highly valuable, addressing a real need in computer vision pathology applications.
- The paper explores transformer attention as an interpretability method.
- The authors support their claims on interpretability quantitatively by collecting actual quantitative data, which is less common in similar studies.
- The model is tested on multiple datasets, demonstrating consistency.
- BvT may have applications beyond pathology, broadening its potential impact.
Main Weaknesses:
- A more in-depth comparison between the attention of BvT and ViT is needed.
- The authors should address the existence of methods like HIPT, which have shown that pure ViTs can produce biologically relevant attention maps. How does BvT respond to this? Does BvT only produce better attention maps in certain cases?
- Clarify which attention heads were shown to domain experts and how they were selected.
- Further evaluation beyond patch classification, such as semantic segmentation, object detection, or slide-level classification, would enhance the convincing power of the proposed architecture in terms of feature extraction.
- Provide additional details about dataset and task selection to ensure representativeness.
- The observed performance drop in the BvT model raises concerns, as a trade-off between clinical accuracy and interpretability could hinder its implementation in clinical settings.
- The reasons behind the proposed explanation leading to a 5% performance gain during generalisation testing are not clear.
- Assess whether the training paradigm is sufficient, as the use of only 20-30 epochs seems potentially low.
- Address any minor statistical issues raised by reviewer #3.
To address these concerns and improve the study, we suggest the following recommendations:
- Conduct an in-depth comparison of attention between BvT and ViT.
- Clarify the comparability of attention heads in Figure 1, such as whether head 1 in ViT and head 1 in BvT attend to similar tissue structures. Also, specify the number of heads used in ViT and BvT in this study.
- Elaborate on the recommendation of transfer learning as the preferred approach and clarify whether sacrificing performance for better explainability is necessary.
- Discuss how BvT compares to methods like HIPT, which have shown that pure ViTs can produce biologically relevant attention maps. Explain cases where BvT produces superior attention maps.
- Provide details on which attention heads were shown to domain experts and describe the selection process.
- Further investigate the simultaneous optimisation of the two objectives as a potential cause for the poor training performance of the BvT model, including adjusting optimisation weights.
- Provide additional explanation and analysis of the results presented in Figure 6.
- Reinforce the rationale behind adding [1 − r, 1 − g, 1 − b] to the RGB channels [r, g, b] for alignment and interpretability, referring to reference 7. Similarly, explain the significance of the resolution of 0.5 mpp.
- Include a more detailed discussion on the performance differences, as they hold substantial importance.

Author Feedback

Dear Area Chair and Reviewers,

We greatly appreciate your time and effort in reviewing our manuscript. Your insightful comments and constructive feedback have not only confirmed the quality of our work, but will help us sharpen our arguments and make our contributions even more compelling. It is truly rewarding to see that all the reviewers reached a consensus and recognized the high standards of clarity and conciseness of our paper, especially in the presentation of the methodological concepts. We are also pleased that the Area Chair has provisionally accepted our manuscript and that all reviewers have voted in its favor. Below, we would like to address some open questions.

Attention maps. R1 suggested the valid idea of adding an in-depth comparison between the attention maps of ViT and BvT and possibly removing Fig. 3. We believe that Fig. 3 provides further insight into our model, as the results for all visualization tools illustrate the high interpretability of our architecture. This reassures us that our model is inherently interpretable and does not just work for the attention maps. The attention maps we show in the paper are extracted from the T/8 models, which use three attention heads in each transformer layer. We always show all attention heads for each model to domain experts. They are compared as they are. This is important because it allows us to see what features the models take into account and provides a deeper look into the different decision processes.

Transfer learning. We thank R3 for highlighting this topic and giving us the opportunity to explain the performance gap between ViT and BvT. When we applied transfer learning and reused the weights from our experiments on NCT for TCGA, we saw a large increase in performance for BvT compared to training from scratch. In this setting, BvT even outperformed ViT. We observed a similar trend in our Bwin experiments. This strongly suggests that optimizing the B-cos transform requires balancing two goals: learning the inductive biases to solve the classification task, and increasing the alignment of the weights and inputs. When BvT is trained from scratch, it must balance these two tasks, potentially sacrificing performance. But once the inductive biases are learned from the data, BvT can use its more biomedically plausible features to outperform ViT. Bwin already comes with a hand-crafted inductive bias from the window attention. Thus, Bwin does not need to learn the inductive biases first and can outperform Swin even when trained from scratch.

Additional channels. We would like to comment on the rationale for adding [1 - r, 1 - g, 1 - b] to the original RGB channels, as suggested by R2. As described in Ref [7], this allows the B-cos transform to capture the directional relationships between the color values. This is especially important when these colors are used specifically for interpretability. We did not analyze color information, but we still kept the inverse channels in BvT for consistency with Ref [7]. In addition to color channels, we also want to address the topic of input resolutions. This is particularly important since medical specimens are scanned at different scales. Therefore, we have chosen datasets at 20x and 100x. As shown in Fig. 1, BvT is able to handle different input magnifications.

HIPT. AC’s reference to the HIPT architecture is a welcome opportunity for us to compare our approach with other models. Analyzing the attention maps obtained from the HIPT paper, we observe that its attention heads can separate between tumor, stroma, and background. However, according to the visualization shown, HIPT is not able to further highlight finer substructures within the same tissue. In contrast, BvT does detect different components of the tumor tissue (see Fig. 1).

We once again express our gratitude for the opportunity to address the primary concerns of the reviewers. We eagerly await the MICCAI community’s application of our proposed methodology.

back to top

B-Cos Aligned Transformers Learn Human-Interpretable Features