Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Yu Zhao, Zhenyu Lin, Kai Sun, Yidan Zhang, Junzhou Huang, Liansheng Wang, Jianhua Yao

Abstract

Considering the huge size of the gigapixel whole slide image (WSI), multiple instance learning (MIL) is normally employed to address pathological image analysis tasks, where learning an informative and effective representation of each WSI plays a central role but remains challenging due to the weakly supervised nature of MIL. To this end, we present a novel Spatial Encoding Transformer-based MIL method, SETMIL, which has the following advantages. (1) It is a typical embedded-space MIL method and therefore has the advantage of generating the bag embedding by comprehensively encoding all instances with a fully trainable transformer-based aggregating module. (2) SETMIL leverages spatial-encoding-transformer layers to update the representation of an instance by aggregating both neighbouring instances and globally-correlated instances simultaneously. (3) The joint absolute-relative position encoding design in the aggregating module further improves the context-information-encoding ability of SETMIL. (4) SETMIL designs a transformer-based pyramid multi-scale fusion module to comprehensively encode the information with different granularity using multi-scale receptive fields and make the obtained representation enriched with multi-scale context information. Extensive experiments demonstrated the superior performance of SETMIL in challenging pathological image analysis tasks such as gene mutation and lymph node metastasis prediction.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_7

SharedIt: https://rdcu.be/cVRq4

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper presents a novel spatial encoding multiple instances learning method for pathological image analysis. It releases an attention-based pyramid multi-scale fusion module, which is a novelty for aggregating the local information of the patches. In addition, the joint relative and absolute position encoding module simulates the diagnosis process of pathologists. Presented the two modules, the transformer-based model gain improved performance in pathological image analysis.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1.The pyramid multi scale fusion is interesting, combining the T2T(tokens to tokens) and MSA(multi-head self attention) enhances the features map’s local representation and this process is trainable. 2.In SET module, the relative encoding leverage the MSA “pay more attention” to local information of neighbouring instances. 3.The network has some advantages in dealing with the tasks, whose local information of the neighbouring patches is vital for the pathological feature.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Trans-MIL has already adapted the transformer to solve the MIL problem in pathological image analysis. Essentially, This paper seems similar to Tran-MIL[17]. The differences with Trans-MIL [17] should be clearly descripted. 2.In fact, The SET module is the relative encoding module, it’s not clearly claimed in the paper. 3.This paper displays that the local information of neighbouring instances is vital for the pathological image analysis. This view is not mentioned in the paper.
2. The pyramid multi scale fusion module and the SET both enhance the local representation of the neighbouring instances, the purpose of this design and the difference of these modules should be described clearly.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

the paper provided sufficient details, the reproducibility is good.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

see 5.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The design of the network is effective. The novelty is not described clearly.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

I read the comments given by other reviewers as well as the author’s response. I insist on my judgement.

Review #3

Please describe the contribution of the paper

This paper proposes SET-MIL, a transformer-based framework for WSI representation learning which incorporates spatial information (neighbouring instances and globally correlated instances) to obtain representations that capture more semantic information.

Based on the experiments, the proposed method outperforms recent MIL-based methods for WSI representation learning task.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper is technically sound.
- The idea of aggregating representations of both neighbour instances and globally correlated instances simultaneously, which mimick the clinical practices, seem to be novel and improve results.
-Experiments are comprehensive.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- I am concerned about the practical application of the current framework as its memory usage is converging to a setting where we feed WSIs directly to the model. This is because that current approach is employing all patches from WSI for representation learning which leads to the famous memory bottleneck that exists for this task.
- The fact that all the patches are being used makes the experiments against many MIL methods unfair as they only employ a small number of patches for training the model.
- Comparing against a framework like: “Neural Image Compression for Gigapixel Histopathology Image Analysis” David Tellez*, Geert Litjens, Jeroen van der Laak, Francesco Ciompi
That uses all patches seem to be a fair comparison.
- The training here is not end-to-end, which may lead to sub-optimal solutions. Recently there have been efforts to develop end-end MIL methods for WSI representation learning. This is necessary to discuss your approach against methods that use a small proportion of patches but trained in an end-to-end manner. Which approach is preferred and Why?
“CNN and Deep Sets for End-to-End Whole Slide Image Representation Learning” Sobhan Hemati, Shivam Kalra, Cameron Meaney, Morteza Babaie, Ali Ghodsi, Hamid Tizhoosh
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The source code has been provided.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- I believe authors should discuss the memory usage of the current approach and also the fairness of the experiments against methods that use a small proportion of patches.
To this end, they can compare their method against the following work, which also uses all the patches. Here is an example:

“Neural Image Compression for Gigapixel Histopathology Image Analysis” David Tellez*, Geert Litjens, Jeroen van der Laak, Francesco Ciompi

Also, this is necessary to benchmark their work against end-to-end methods. Here is an example:

“CNN and Deep Sets for End-to-End Whole Slide Image Representation Learning” Sobhan Hemati, Shivam Kalra, Cameron Meaney, Morteza Babaie, Ali Ghodsi, Hamid Tizhoosh
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although the practical application of the current framework is questionable, the ideas in SETMIL for mimicking the clinical practice by aggregating representations of both neighbour instances and globally correlated instances are interesting.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Somewhat Confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #2

Please describe the contribution of the paper

The presented paper introduces a MIL approach for WSI level feature embedding and classifier with position preserving embedding followed by Transformer-based Pyramid Multi-Scale Fusion. Results were verified in two datasets in the paper and TCGA experiments in the supplementary materials.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

In general, the work seems novel, and the results are promising. I would support this work as it addresses the main computational pathology concerns. 1- Providing slide level description for WSIs is what people need with current datasets like TCGA. MIL approaches are an excellent way to use this slide-level information. 2- The idea of preserving all patches coordination is new to the best of my knowledge. This is important as tissue morphology is meaningful with looking at its neighbours. 3- Multi-level view is necessary when dealing with WSIs, and this work uses the multi-scale approach smartly. 4- Experiments are well designed and multiple datasets have been explored, however, I have some concerns regarding the experiments comparison.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1- The main consern is the fairness between experiments. Most MIL methods are using small amount of WSI whiledeveloped method is using all patches in one WSI. So it might not be fair to compare light algorithm with bruteforce method. 2- The other weakness is connected with the first one. As feeding whole WSI to the model (even with L’’ and W’’) still model is large. Such that just 4 of them fitted to V100. As a result, the work might not be repruducable with regular GPUs. 3- The last weakness is feeding the blank space to the model. Even if model remove them after training still they will be fed to the model. 4- I wish more pathology related point of view like multi-magnification (5x, 10x, 20x) was used rather multi-level
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Sharing code is one of the best practices for reproducibility, and I would say the work is reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Fig. 2. Sub-figure (A) - Color meanings is not considered. For a while I was confused with yellow green blue color code for 3x3, 5x5, and 7x7 windows. However, right below that the same colors have been used for showing unfold without any relevance to the windows size.

= Concat(T2Tκ=3(Ei), T2Tκ=5(Ei), T2Tκ=7(Ei)), it needs period (.) after this equation.

“achieve inferior performance compared to other methods [1, 9, 8].” should be other MIL methods

Table 1.needs patch percentage from each WSI
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Enough innovation and extensive experimentation plus acceptable results are main reasons for the acceptance.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #4

Please describe the contribution of the paper

This paper proposes a transformer-based MIL method for WSI analysis, which consists of position-preserving encoding, feature fusion, and bag embedding with spatial encoding. Specifically, both absolute and relative position information are considered via the sinusoid procedure and learnable weights, respectively. The method achieves promising results on two datasets, and the effect of each component is evaluated by ablation studies.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The organization of the paper is good, where the method is clearly demonstrated and well-motivated.
2. The experiment is basically complete. The proposed method achieves performance gain on different benchmarks. The ablation studies prove the effectness of each component.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Limited novelty: Position-preserving encoding shares the same idea with NIC [1], the Feature fusion procedure is similar to TransMIL [2], and the spatial encoding transformer is also widely used as in the reference [24, 27] in the paper (except the relative position encoding). It would help to highlight the novelty of the method.
2. More explanation of relative position encoding: It is interesting to discuss the way to encode 2D positions in WSIs. However, whether the learnable scalar can learn meaningful results is still unknown and not discussed, though ablation studies show improved performance.
3. Fair comparison: a) It would be better to report the # of params of each method. As so many components are used, one may argue that the improvements come from the larger models. b) It would be better to include the baseline model used in the method instead of only proving the necessity of each component. Ref: [1] Tellez, David, et al. “Neural image compression for gigapixel histopathology image analysis.” TPAMI, 2019. [2] Shao, Zhuchen, et al. “Transmil: Transformer based correlated multiple instance learning for whole slide image classification.” NeurIPS, 2021.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Good. Code is avaiable.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. Limited novelty: Position-preserving encoding shares the same idea with NIC [1], the Feature fusion procedure is similar to TransMIL [2], and the spatial encoding transformer is also widely used as in the reference [24, 27] in the paper (except the relative position encoding). It would help to highlight the novelty of the method.
2. More explanation of relative position encoding: It is interesting to discuss the way to encode 2D positions in WSIs. However, whether the learnable scalar can learn meaningful results is still unknown and not discussed. It would help to show the learned scalar like the right-up corner in Fig.3. If it works like the Gaussian kernel, it would be better to discuss the relationship with patch-based CNN [3], which explicitly uses the Gaussian kernel for spatial relationships modeling.
3. Fair comparison: a) It would be better to report the # of params of each method. As so many components are used, one may argue that the improvements come from the larger models. b) It would be better to include the baseline model used in the method instead of only proving the necessity of each component. Ref: [1] Tellez, David, et al. “Neural image compression for gigapixel histopathology image analysis.” TPAMI, 2019. [2] Shao, Zhuchen, et al. “Transmil: Transformer based correlated multiple instance learning for whole slide image classification.” NeurIPS, 2021. [3] Hou, Le, et al. “Patch-based convolutional neural network for whole slide tissue image classification.” CVPR. 2016.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Novelty and fair comparison.
Number of papers in your stack

7
What is the ranking of this paper in your review stack?

4
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

Limited novelity.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper presents a weakly-supervised analysis method for WSIs. The method is developed based on transformer, with position-preserving encoding, feature fusion and bag embedding with spatial encoding. Evaluation is conducted on two private datasets and shows improved performance over other MIL methods. The reviewers are generally positive about the paper, however, there are many comments related to lack of novelty, big memory consumption, fairness in performance comparison, non end-to-end learning, etc. In addition, since both datasets are private yet MIL for WSI classification has been extensively studied in the literature, it would be more convincing to evaluate the method on publicly available datasets, especially CAMELYON16, which has a clear train/test split and has been used in DSMIL and TransMIL. While TCGA datasets are used as described in the supplementary, those results would be affected by the number of images and train/test split.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

6

Author Feedback

We thank the generally positive assessments and respond to main concerns.

1.Novelty [R1 R4] We first clarify the challenges of our tasks. Different from tumor classification in TCGA or lymph node metastasis (LNM) detection in CAMELYON16 (WSIs from sentinel lymph nodes), where the WSI-level label has patch-level representations, our gene mutation and LNM prediction (WSIs from primary tumor region) tasks do not have local patch-level representation and global factors such as tumor microenvironment and tumorous cell proportion are more related to the WSI-level label. Furthermore, our two tasks are new and with no established clinical guidelines. Pathologists expect AI can provide explanations as potential guidelines (contribution of regions). We developed a generalizable and interpretable MIL solution for these novel pathological image analysis problems where the entire WSI should be considered. Specifically, we developed a transformer-based model which organically unifies the following functions: embedding the whole WSI, multi-scale learning, and adaptively adjusting learnt features from neighbors and/or from globally corrected regions. Our design can be regarded as a combination of GNN (feature distances as the edges) [Zhao, et al. FS-GCN-MIL CVPR 2020] and CNN (with circle kernel), which is the first such kind of design for pathological image analysis.

2.Comparison Fairness (all vs selected patches) [R2 R3 R4] As discussed in response 1, embedding the entire WSI is necessary to solve our clinical tasks. Therefore, we compared our model with other entire-WSI-embedding-capable models including CLAM, TransMIL, ViT-MIL (our baseline) etc. All methods were tested using the same extracted raw instance-level features to evaluate their bag aggregation performance. We also compared our method with existing methods on the tasks where the WSI-level label has a patch-level representation and therefore is suitable for the selected-patch model on the public TCGA dataset. Our model outperformed TransMIL/DSMIL/CLAM which have been reported having better performance than selected-patch-based methods such as max/average-pooling-based methods [Shao et.al. NIPS 2021] (supplementary Table 3). NIC [David, et al. TPAMI 2019] utilizes CNN for bag aggregation, which was compared in our work (Table1, denoted as CNN-MIL). Besides, random patch-selection methods such as CNN-DS [Hemati, et al. MIDL 2021] face the danger of missing informative patches during bag aggregation, it obtained inferior accuracy (86%) in TCGA LUAD/LUSC classification than our method (89.3%).

3.Memory consumption [R2 R3 R4] Our PPE module dramatically down-scales the memory requirement when embedding the entire WSI. As tested, our model worked on a cloud server with 1/2 NVIDIA TESLA T4 (8GB Memory), therefore it should work on most regular GPUs.

4.Whether end-to-end [R3] DNNs can learn generalizable patterns in the early layers. We use a pre-trained model to extract elementary high-level concept features and our TPMF module can then fine-tune these features for following MIL tasks. It can be regarded as an end-to-end model with the first layers fixed. Besides, our design allows using more data and different strategies such as contrastive learning to train the patch encoder and therefore has the potential to obtain more generalizable features, while end-to-end training normally only uses the current task’s dataset to train the patch encoder. Furthermore, end-to-end models have been compared in previous work such as DSMIL, TransMIL, FS-GCN-MIL and showed inferior performance.

5.Feeding blank space [R2] WSIs have different patch numbers. Therefore, most MIL methods used batch-size of 1, e.g. ABMIL, DSMIL, TransMIL. The feeding blank space strategy allows our model to have the same feature size and therefore can use bigger batch sizes.

6.More datasets [Meta-R] SETMIL has been tested on 4 datasets. Due to the limited space, we would involve more datasets in the future.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper presents a weakly-supervised analysis method for WSIs. The study presents an interesting problem although the datasets are private. The rebuttal has addressed most of the comments, although might not be so convincing in terms of novelty. The final version should reflect the clarifications described in the rebuttal.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

10

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper addresses the problem of training classifiers of WSIs using weak slide level labels – this is a very active research area and one of great importance in digital pathology. Some of the reviewers expressed concerns about the ability of this method to run on conventional GPUs but this appears to have been addressed by the authors in the rebuttal. Although the request by reviewers that 2 WSI approaches are used for comparison, the methods used as baselines in the paper are generally considered to be SOTA and the performance gains reported over those methods are impressive. As meta reviewer #1 points out, Camelyon would have been an excellent test set to use as one of the baselines.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

2

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The presented paper introduces a weakly supervised approach for WSI level feature embedding and classifier with position preserving embedding followed by Transformer-based Pyramid Multi-Scale Fusion. Results of the presented approach were validated on two datasets in the main paper and TCGA experiments in the supplementary materials. With regard to the critiques, the authors clearly rebutted the concerns. While the concern regarding novelty still exists, the proposed method demonstrates improvement over state-of-the-art and has merit in terms of clinical utility.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

3

back to top

SETMIL: Spatial Encoding Transformer-based Multiple Instance Learning for Pathological Image Analysis