Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Saisai Ding, Jun Wang, Juncheng Li, Jun Shi

Abstract

Whole slide image (WSI) classification is an essential task in computational pathology. Despite the recent advances in multiple instance learning (MIL) for WSI classification, accurate classification of WSIs remains challenging due to the extreme imbalance between the positive and negative instances in bags, and the complicated pre-processing to fuse multi-scale information of WSI. To this end, we propose a novel multi-scale prototypical Transformer (MSPT) for WSI classification, which includes a prototypical Transformer (PT) module and a multi-scale feature fusion module (MFFM). The PT is developed to reduce redundant instances in bags by integrating prototypical learning into the Transformer architecture. It substitutes all instances with cluster prototypes, which are then re-calibrated through the self-attention mechanism of the Trans-former. Thereafter, an MFFM is proposed to fuse the clustered prototypes of different scales, which employs MLP-Mixer to enhance the information communication between prototypes. The experimental results on two public WSI datasets demonstrate that the proposed MSPT outperforms all the compared algorithms, suggesting its potential applications.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_58

SharedIt: https://rdcu.be/dnwKd

Link to the code repository

N/A

Link to the dataset(s)

TCGA https://portal.gdc.cancer.gov

CAMELYON16 https://camelyon17.grand-challenge.org

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes a way to create prototypical instances representative of bags to make MIL training efficient and additionally propose a fusion module to aggregate information across scales. The method is compared against other methods on two publicly available datasets and shows superior performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed formulation for generating prototypical instances using attention to improve the prototypes generated using clustering is novel and interesting. Additionally using MLPMixer to to aggregate prototypes from multiple resolutions is also novel and insightful. Experiments are comprehensive and ablations show usefulness of the proposed method
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

I am little concerned with the overhead of the method, clustering based prototype generation has significant compute overhead. If prototype learning is primarily useful for handling large bags, it should be compared against methods like Nystrom Attention (used in TransMIL) which can reduce the memory and compute overhead. Although the authors did compare against TransMIL, it doesn’t have multi-res pooling so the comparison doesnt address this issue.

The motivation behind preferring an MLPMixer over attention based fusion is not discussed. The ablations show its better, but it’s unclear what type of attention pooling was used there.

The authors compare against ReMix which is a single resolution based prototype based method. Its unclear if the improvements over ReMix are due to PT or the multi-res features. Can results be added with just the PT on single resolution.

Some of the details arent clear. Is the encoder frozen during training? How is backprop handled through k-means if it isn’t?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors promised source code and pre-trained models
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The authors should discuss the memory vs training overhead of their method vs other transformer based MIL methods. Considering the many improvements to attention based methods like Nystrom attention (used in TransMIL), flash attention, etc it would be valuable to understand how prototypical learning compares to them
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although there are concerns over the practicality of prototype based MIL approaches, the ideas in the paper for improving prototype learning and fusing multi-res features is interesting.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The authors proposed a novel prototypical transformer with a multiscale mechanism to learn the superior prototype representation in WSI classification. Unlike previous methods that use K-means for fixed prototypical feature generation, they propose to recalibrate the K-means prototypical features using self-attention mechanism. Moreover, the self-attention between prototypical features and bags is relatively computationally efficient compared to self-attention for all bags. The authors validated their methods by comparing them with SOTA and conducting ablation studies on two datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Propose an efficient and effective patch aggregation method in multi instance learning of pathology images. Experiments showed that the proposed method outperformed SOTA methods (k-means, multi-scale, etc) in two datasets.
2. The influence of K is shown in the result plot. And the comparison of different aggregation methods are compared.
3. Generally, the paper is well-written.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. When comparing with other SOTA method (table 1), was the multi-resolution (x5/x10/x20) mechanism applied on all other methods for a fair comparison? This should be clarified in details.
2. What is the meaning. of T In the figure 2?
3. A small type in the abstract: multi-scale feature fusion module (MFFM) module.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The main idea should be easy to implement and the datasets are public.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Is the shared MLP layers used for different prototypes? The figure 2 is a little confusing.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The aggregation of patches is a key problem in WSI pathology image analysis. The author proposed an efficient and effective method , which outperforms other SOTA methods on two public datasets.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper aims to use a combination of prototype learning and Transformer models to address WSI classification tasks. The authors introduce a prototypical Transformer component that can cluster multiple instances, and a multi-scale feature fusion module based on MLP-Mixer to combine features from different magnifications. The effectiveness of these methods is verified through experiments conducted on two datasets, CAMELYON16 and TCGA-NSCLC.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1, The paper is written in a clear and easy-to-follow manner, making it easy to understand the proposed method.

2, The motivation behind using Transformer-based prototype learning for clustering instances is sound, and the use of MLP-Mixer to fuse features from different magnifications is a reasonable approach for WSI classification.

3, The use of embeddings from clustering as queries can significantly reduce computational costs for the Transformer-based module, making the approach more efficient.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1, The format of the paper could be improved to meet the submission guidelines. For instance, the font and equation formatting should match the template, and there should be consistent spacing between the main content and sub-chapters.

2, The paper’s experiments on ablation study are limited, which may make it challenging to verify the effectiveness of the proposed methods.

3, The authors should consider including some visualization results to help demonstrate the effectiveness of the PT and MFFM methods.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Yes, the authors provide experimental settings that are detailed and can be reproduced.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

1, The paper’s format, including the font and equation formatting, could benefit from improvements to ensure clarity and ease of reading.

2, The authors should consider including proper capitalization and formatting throughout the paper to ensure consistency and readability, such as using “Whole Slide Image” instead of “Whole slide image”.

3, The equations used in the paper should be clearly explained, including the meaning of variables such as $d_k$ and how to obtain $P_{bag}$.

4, Figure 2 could be improved by including clearer descriptions of the input, output, and shape of features.

5, Equations 3-6 could potentially be combined to improve clarity, and more detailed information about features from different magnifications should be included.

6, It would be helpful if the authors could explain how to optimize K-means clustering based on the Transformer-based architecture, as well as conduct experiments to explore computational costs and the impact of different numbers of clusters for different magnifications.

7, For the MFFM, the authors could explore different combinations of magnifications, such as 5x+10x, 10x+20x, or using only one magnification.

8, Including visualization results for the PT, MFFM, and slide-level heatmap could enhance the clarity and impact of the paper.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

1, the detailed information about methods seems limited, such as how to optimize PT, the detailed pipeline about MFFM

2, More ablation Study about PT, MFFM should be included
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

I’ve carefully checked authors’ rebuttal, and upgrade the score as weak accept. Hope major and minor revision will be made.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes a novel approach for efficient MIL training in WSI classification by creating prototypical instances representative of bags and introducing a fusion module to aggregate information across scales. Experimental results demonstrate its superior performance compared to other methods on two publicly available datasets.

Given the conflicting opinions among the reviewers, it is crucial to address the concerns raised by Reviewer 3 in the rebuttal. The reviewers emphasize the need for more comprehensive ablation studies to validate the effectiveness of the proposed methods. It is strongly recommended to include visualization results for the prototype transformer (PT), multi-scale feature fusion module (MFFM), and slide-level heatmap to enhance clarity and impact. Additionally, optimizing K-means clustering based on the Transformer-based architecture and exploring different combinations of magnifications in the MFFM should be considered. Lastly, to ensure a fair comparison, it should be clarified that the multi-resolution mechanisms were used in the comparison with other state-of-the-art methods.

Author Feedback

Summary: Thanks to all the reviewers for acknowledging our methodological contribution. We are pleased that they find this work novel and insightful (R1), efficient and effective (R2), and well-motivated with clear objectives (R3). The following is our point-by-point response to reviewers’ comments.

Q1 (R1, R3, Meta-R): “The computation overhead of PT.” A1: The traditional MIL model is not an end-to-end network, it contains multiple stages, and the patch-level feature extraction (denoted as X_bag) and cluster-based prototype generation (denoted as P_bag) belong to the pre-processing stage. Therefore, their training budgets are not considered in the training of PT. We compared the training budgets of PT and TransMIL, i.e., the average training time per epoch (5s vs. 12s) and the peak memory consumed (2.37G vs. 2.39G) during training. It can be found that PT can improve the training speed at similar memory consumption as TransMIL on the Came16 dataset, we will include this in the final version.

Q2 (R1, R3, Meta-R): “How to optimize K-means clustering using PT, and the impact of different numbers of clusters for different magnifications.” A2: The optimization process can be divided into two steps: 1) the initial cluster prototype bag P_bag is obtained in the pre-processing stage by using the k-means clustering on X_bag; 2) PT uses X_bag to optimize P_bag via the self-attention mechanism in Transformer, as described in Eq. 2. Please note that, to construct the multi-resolution feature pyramid, the number of cluster prototypes must be the same at different magnifications. Therefore, we only performed experiments at 20× resolution in Fig. 2 to determine the optimal number of prototypes, since 20x features play a decisive role in classification.

Q3 (R1, R2, Meta-R) “When comparing with other SOTA methods (Table 1), was the multi-resolution mechanism applied on all other methods for a fair comparison?” A3: Since other SOTA comparison methods were only originally designed for a single resolution, they were only implemented at 20× resolution. In fact, we have extended Max-pooling and ABMIL to the multi-resolution methods in Table 2, the results indicated that our MSPT still achieved the best performance. For a fair comparison, our single-resolution model PT is added to Table 1 and outperforms ReMix on most metrics.

Q4 (R3, Meta-R) “Lack of Visualization results.” A4: We visualized the attention scores from PT as a slide-level heatmap to determine the ROIs of WSI and provided high-attention patches of different resolutions for interpretability. We will include it in the Supplementary Materials.

Q5 (R2, R3, Meta-R) “Exploring different combinations of magnifications in the MFFM.” A5: We have studied the impact of different multiscale schemes (i.e., 5×+10×, 5×+20×, and 5×+10×+20×) for the MFFM, and the “5×+10×+20×” scheme obtains the best performance. It will be included in the Supplementary Materials.

Q6 (R2, R3) “The detailed pipeline about MFFM and Fig. 2 is a little confusing.” A6: MFFM consists of an MLP-Mixer and a Global Attention-pooling (GAP). The MLP-Mixer is used to enhance the information communication among the prototype representation, and the GAP is used to get the WSI-level representation for WSI classification. We adjusted Eq. 3-6 to detail the specific feature processing process of MFFM and redesigned Fig. 2 to clarify the overall flow of MFFM.

Q7 (R1) “The motivation behind preferring an MLP-Mixer over attention-based fusion is not discussed.” A7: MS-Attention uses attention-pooling of ABMIL on the cluster prototypes of each scale to get the prediction and then adds all predictions to obtain the final classification result. Therefore, the interaction between prototypes of different magnifications is limited. On the contrary, the MLP-Mixer in our proposed MFFM allows information communication among different prototypes and prototype features to learn superior representation for classification.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The concerns raised by the reviewers have been largely addressed in the rebuttal, resulting in R3 revising the score positively. I believe that if the points presented in the rebuttal are incorporated into the manuscript, it would be appropriate to accept the paper.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper proposes a transformer-based architecture that integrates multi-scale information and prototype learning for the task of whole slide image classification. Key strength include the combination of ideas in this paper and the motivation for the different components and the evaluation on two public datasets with superior performance. In the initial review, the extent of the ablation experiments, missing visualizations and missing information on the clustering was criticized. In their rebuttal, the authors addressed the clarification questions satisfactorily; however, the authors also promised a number of added results compared to the submitted version of the paper. This is not ideal as the review and rebuttal process at MICCAI is not made out for this - added results cannot be checked with the same scrutiny as the original submission.

After the rebuttal, the paper has three weak accepts. Given the support for the general motivation of the different components and small improvements compared to existing approaches, this results in a borderline accept from my perspective.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper proposes a MIL training for WSI classification by using prototypical instances representative of bags and also aggregating information across multiple scales. Reviewers emphasize the motivation of the network design but have a few concerns, e.g. computational overhead, and the comparison to other multi-scale approach. Author rebuttal well addressed these issues and persuade the initial negative reviewer to raise his/her score. Therefore, I suggest to accept the paper.

back to top

Multi-Scale Prototypical Transformer for Whole Slide Image Classification