Authors

Jingwei Zhang, Saarthak Kapse, Ke Ma, Prateek Prasanna, Joel Saltz, Maria Vakalopoulou, Dimitris Samaras

Abstract

Whole slide image (WSI) classification is a critical task in computational pathology, requiring the processing of gigapixel-sized images, which is challenging for current deep-learning methods. Current state of the art methods are based on multi-instance learning schemes (MIL), which usually rely on pretrained features to represent the instances. Due to the lack of task-specific annotated data, these features are either obtained from well-established backbones on natural images, or, more recently from self-supervised models pretrained on histopathology. However, both approaches yield task-agnostic features, resulting in performance loss compared to the appropriate task-related supervision, if available. In this paper, we show that when task-specific annotations are limited, we can inject such supervision into downstream task training, to reduce the gap between fully task-tuned and task agnostic features. We propose Prompt-MIL, an MIL framework that integrates prompts into WSI classification. Prompt-MIL adopts a prompt tuning mechanism, where only a small fraction of parameters calibrates the pretrained features to encode task-specific information, rather than the conventional full fine-tuning approaches. Extensive experiments on three WSI datasets, TCGA-BRCA, TCGA-CRC, and BRIGHT, demonstrate the superiority of Prompt-MIL over conventional MIL methods, achieving a relative improvement of 1.49%-4.03% in accuracy and 0.25%-8.97% in AUROC while using fewer than 0.3% additional parameters. Compared to conventional full fine-tuning approaches, we fine-tune less than 1.3% of the parameters, yet achieve a relative improvement of 1.29%-13.61% in accuracy and 3.22%-27.18% in AUROC and reduce GPU memory consumption by 38%-45% while training 21%-27% faster.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43993-3_60

SharedIt: https://rdcu.be/dnwN5

Link to the code repository

https://github.com/cvlab-stonybrook/PromptMIL

Link to the dataset(s)

N/A

Reviews

Review #3

Please describe the contribution of the paper

In this paper, the authors introduce visual prompt tuning to end-to-end MIL classification. Results show significant improvements over the full fine-tuning approach across three datasets. Additionally, the proposed method is faster and requires less GPU memory.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Introducing prompt tuning to end-to-end multi-instance learning is novel, which is easier and faster training, and lower memory requirements compared to full tuning.
2. this paper is well-writen and easy to follow.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. In the introduction, the authors claim that end-to-end training methods offer better performance compared to conventional MIL. Based on this observation, they attempt to improve end-to-end MIL methods. However, in their results, the full tuning approach yields a performance drop that is not consistent.
2. I suspect his full tuning training strategy is unreasonable. Since there are no details provided about the full fine-tuning approach, it is uncertain if the training settings were the same as the prompt tuning method. It is worth noting that full fine-tuning typically requires a lower learning rate.
3. The results show significant improvements over the full fine-tuning approach across three datasets, but the improvements are marginal compared to conventional MIL methods. The time and memory consumption of conventional MIL methods are not mentioned in the given text. However, if prompt tuning is much slower and more memory-intensive than conventional MIL, it raises doubts about the necessity of using prompt tuning to achieve marginal improvements. Further investigation and comparison of the computational costs of both approaches may be needed to make an informed decision.
4. While the proposed prompt MIL method is model-agnostic, the experiments only evaluated the approach on ABMIL, which may not be sufficient to fully validate the effectiveness of the method. It would be beneficial to evaluate the approach on a wider range of MIL models, such as CLAM, TransMIL, and DTFT-MIL, to further demonstrate its effectiveness.
5. The results of the experiment only include a single fold, which may not be robust enough to draw conclusive findings from such small datasets. To address this limitation, it would be beneficial to perform 5-fold cross-validation and run multiple experiments with different random seeds to ensure the effectiveness of the proposed approach is not influenced by random variation.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The results is reproducible. The experiments were conducted on three public datasets. Additionally, the authors will publish the source code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

refer to the weaknesses section.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. My major concern is this prompt tuing based end-to-end training is necessary, given that it may lead to increased training time and memory usage, despite only marginally improving performance.
2. The results are insufficient and unconvincing, as they were only applied to a single MIL method and run once.
3. The results do not align with their claims made in the introduction regarding full tuning.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The paper discusses the use of multiple instance learning (MIL) techniques for whole slide image (WSI) classification in computational pathology, where WSI is divided into instances or patches and a feature extractor is used to generate features for each instance, which are then aggregated for WSI-level prediction. The authors propose a novel framework called Prompt-MIL that uses prompts for WSI-level classification tasks within an MIL paradigm. The proposed method fine-tunes a SSL pretrained ViT feature extractor with a trainable prompt that calibrates the representations making them task-specific. Authors did extensive experiments on three public WSI datasets demonstrating the superiority of Prompt-MIL over conventional MIL methods in terms of accuracy, AUROC, GPU memory consumption, and training speed.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-Simple but novel and effective algorithm to train only a small fraction of parameters (prompts) to calibrate the pretrained representations to encode task-specific information.
- Extensive experiments on three public WSI datasets
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

-I understand the challenge for learning features based WSI and the proposed algorithm is simple but effective solution. However, in some scenarios, it is also possible to classify image patches inside an WSI. For example, in fatty liver disease. I also would like to see how the method compares against trained features either using self-supervised learning or task-specific supervised learning. In other words, instead of freezing features for each patch, if we train features in an end-to-end approach, how the results would compare to the proposed method.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Codes are not available but datasets used in this study are publicly available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

In some studies, patch level annotions are available and so features can be learned directly. It is of interest to compare your method (using frozen features but with trainable prompts) versus task-specific trained features. See the study below:

Heinemann, Fabian, et al. “Deep learning-based quantification of NAFLD/NASH progression in human liver biopsies.” Scientific Reports 12.1 (2022): 19236.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed framework is simple but yet novel and interesting. Extensive experiments demonstrate the utility of the proposed framework for WSI classification.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

In this paper, author has proposed Prompt-MIL, an MIL framework that integrates prompts into WSI classification. Prompt-MIL adopts a prompt tuning mechanism, where only a small fraction of parameters calibrates the pretrained features to encode task-specific information, rather than the conventional full fine-tuning approaches.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Some strength of the papers are mentioned below. 1) Compared to conventional full fine-tuning approaches, this work fine-tune lesser the parameters, yet achieve a relative improvement in accuracy and in AUROC and reduce GPU memory consumption while training faster. 2) Author has utilize the prompt tuning techniques to address the subpar performance of SSL-pretrained vision transformers, 3) Author has explored end-to-end training for the entire network using SSL pretrained ViTs.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1) It is not very clear by this work how and why previous MIL lagged and needs improvement.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Author has convey, clear, specific and complete information about data, code, models and computational methods and analysis that support the contents and result presented in the paper
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The author has contributed for paper entitled “ Prompt-MIL: Boosting Multi-Instance Learning Schemes via Task-specific Prompt Tuning”. The content of paper is very interesting and well written. Here I would like to provide some comments mentioned below as per the reviewers comment and response provided by the author

1) How the classification loss it calculated for the desired experimental setup. 2) How the tissue patches has been created before splitting in to the number of the batches. 3) Author has used three dataset for this work. It is advisable to talk about the type of data and number of samples used in a tabular form which is more visible in the manuscript. 4) How the model is explainable for the task specific features captured by the prompt. 5) How the prompt token k is affecting the accuracy and AUROC for the specific datasets.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Author has well written the manuscript and has done much research on related work in the past to support the content of the paper. Author has adopted the method and amend the techniques as per the need of this experimental setup, which makes the author’s contribution significant towards this submission.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
This paper proposes a framework called Prompt-MIL that uses prompts for WSI-level classification tasks within an MIL paradigm. Key strengths:
1. Novel approach utilizing the prompt tuning techniques to address the subpar performance of MIL.
2. Extensive experiments on three public WSI datasets
3. this paper is well-writen and easy to follow.
Key weaknesses:
1. Need more details about method implementation.
2. the full tuning training strategy is not optimized.
3. The results do not align with their claims made in the introduction regarding full tuning.

Author Feedback

We thank reviewers and AC for constructive criticism and positive evaluation. Here we address major concerns, while minor ones will be corrected in the camera ready version.

[R3] Results not consistent with the second paragraph on page 2 that E2E training offers better performance over conventional MIL. There is no inconsistency. The results in the 2nd paragraph on page 2 refer to the setting of ImageNet pretrained ResNet. Whereas the results in Table 1. refer to the setting of SSL pretrained ViT and are consistent with our claim in the 3rd paragraph on page 2, i.e., full fine-tuning achieves subpar performance than conventional MIL.

[R3] Full tuning training strategy is not optimal. We list the detailed settings as follows and will include it in the camera-ready version. For all full fine-tuning experiments, we used the learning rate (lr) in the corresponding prompt experiment as the base lr. For parameters in the feature model F, which are SSL pretrained, we use 1/10 of the base lr. For parameters in the Classifier G, which are randomly initialized, we use the base lr. We train the full tuning model for 10 more epochs than our prompt training to allow full convergence. This training strategy is optimized using the validation dataset and we try to be fair on all the comparisons.

[R1] Details of our experiment settings. We clarify this as follows: 1) We use binary cross entropy loss when the task is tumor subtype classification and use cross entropy loss otherwise (Section 2, last paragraph, equation 7) 2) We cropped the tissue region of each WSI into non-overlapping 224x224 patches before splitting them into batches. 3) The type of data and the number of samples are introduced in detail in section 3.1. 4) The effect of the number of prompt tokens k is evaluated on two datasets in the ablation study on page 8. We show that a single trainable prompt token is sufficient to boost the performance of conventional MIL methods.

[R1] Explainability of the task specific features As shown in Fig.1. in supp., attention maps on both the patches and WSIs show that our prompt token can generate task specific features and guide the model to focus more on task specific regions like tumor region.

[R3] Marginal improvement while much slower and more memory-intensive The performance improvement is not marginal. Prompt-MIL achieves up to 4.03% improvement in accuracy and up to 8.97% in AUROC compared to conventional MIL. Our method requires more GPU memory and is slower than conventional MIL, only during model training. During inference, Prompt-MIL and the conventional MIL are similar in speed and memory.

[R3] Evaluated on ABMIL only Our model is tested not only on ABMIL(in supp.), but also on DSMIL (in main paper) and CLAM (in supp.), proving the effectiveness and generalizability.

[R3] Influence of random variations Table 1 and in Supplementary table 1 report the average results across 3 runs using different random seeds.

[R2] Classification of image patches using patch level labels Interesting future direction! However, here we are investigating scenarios where patch level labels are not available, which is the most common case in clinical practice since obtaining patch-level labels is very expensive.

[R1] Why the previous MIL lagged and needs improvement As mentioned in the 2nd paragraph on page 2, previous MIL methods freeze their feature extractors due to GPU memory limitations. Their features are not fine-tuned towards the downstream tasks and thus are task agnostic. Our prompt tuning strategy can calibrate these features to be task specific.

back to top

Prompt-MIL: Boosting Multi-Instance Learning Schemes via Task-specific Prompt Tuning