Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Kun Wu, Yushan Zheng, Jun Shi, Fengying Xie, Zhiguo Jiang

Abstract

Transformer-based multiple instance learning (MIL) framework has been proved advanced for whole slide image (WSI) analysis. However, existing spatial embedding strategies in Transformer can only represent fixed structural information, which are hard to tackle the scale-varying and isotropic characteristics of WSIs. Moreover, the current MIL cannot take the advantage of a large number of unlabeled WSIs for training. In this paper, we propose a novel self-supervised whole slide image representation learning framework named position-aware masked autoencoder (PAMA), which can make full use of abundant unlabeled WSIs to improve the discrimination of slide features. Moreover, we propose a position-aware cross-attention (PACA) module with a kernel reorientation (KRO) strategy, which makes PAMA able to maintain spatial integrity and semantic enrichment during the training. We evaluated the proposed method on a public TCGA-Lung dataset with 3,064 WSIs and an in-house Endometrial dataset with 3,654 WSIs and compared it with 4 state-of-the-art methods. The results of experiments show our PAMA is superior to SOTA MIL methods and SSL methods.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_69

SharedIt: https://rdcu.be/dnwKs

Link to the code repository

https://github.com/WkEEn/PAMA

Link to the dataset(s)

https://portal.gdc.cancer.gov/projects/TCGA-LUAD


Reviews

Review #2

  • Please describe the contribution of the paper

    The paper proposes a self supervised representation learning framework which allows encoding position and semantic information of patches in the WSI. The experiments indicate superior performance over existing pre-training methods and have ablations to indicate the benefits of the positional encoding

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written (albeit with some typos) and extends the MAE pre-training task to encode the relative position and orientation of patches in WSI. The position aware cross attention which helps fuse position and semantic information is interesting. The evaluation looks thorough and shows superiority of the pre-training approach over existing methods specially in low-data regimes.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Its unclear how the image, distance and orientation features are combined for the classification task. The authors mention a cls token but it’s unclear where it’s added and how it interacts in the PACA attention module. How do the number of features and parameters compare between MAE and MAE+ considering we are adding the new position features

    The motivation of the kernel reorientation is unclear to me, the authors should expand on this.

    The anchors and cross-attention make it easier to model interactions in large bag-sizes which is problematic with self attention. The authors may want to contrast the number of params and computational cost of this method vs something like TransMIL

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    One of the datasets is public and the authors promised source will be released

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The authors should expand on the motivation for the anchor based distance and orientation features .

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The ideas proposed in the paper are interesting and show significant improvements over the current pre-training methods.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper
    1. A position-aware masked autoencoder method and a self-supervised learning approach for histology whole slide-level representation learning, which applies masking on local-patch-level representation to construct both the position and feature.
    2. A computationally-saving position-aware cross-attention mechanism to guarantee the correlation of local-to-global information in training.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Overall the paper is well-organized with clear visualization. The method is well-tailored for histology analysis, which have attempted to consider some specific property of histology image such as isotropic.
    2. Evaluations on both the public TCGA-Lung dataset and in-house datasets.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The motivation should be more convincing: Multiple instance learning (MIL) is a sort of weakly supervised learning, why should it consider the large number of unlabeled WSIs? As the unlabeled WSIs can usually be tackled by sem- or self- supervised learning while the WSI-level label for MIL is very easy to acquire.
    2. Important concept is not well-defined. What is the isotropic property of WSI is missing.
    3. WSI-level task usually consider survival predictions, biomarker predictions and etc beyond tumor category classification, which is missing in this work.
    4. In the introduction, authors claimed `Moreover, we designed a position- aware cross-attention mechanism to guarantee the correlation of local-to-global information in the WSIs while saving computational resources.’ There is no experiments to show the computational efficient superiority of the method (such as the training time, number of parameters comparison).
    5. Inference time, convergence comparisons are important to complement the performance comparisons as in histology analysis the time is usually important.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    1. The authors claimed will release the codes after acceptance, which might contribute to the reproducibility.
    2. Yet, the hyper-parameters are not given.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. How to achieve both randomness and keep the proportion of categories similar in data splitting, as stated `Each dataset was randomly divided into training, validation and test sets according to 6:1:3 while keeping each category of data proportionally’.
    2. It’s interesting to see the affect of (1) different patch-level feature extraction method, such as whether the proposed method is sensitive to extracted features (2) different masking ratio to the overall performance.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See weakness in Q6.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    This paper developed a self-supervised whole slide image representation learning framework. The proposed method aims at using unlabelled WSIs to improve the discrimination of slide features based on a masked autoencoder and position-aware cross-attention module.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Developing a decent self-supervised learning framework to generate the representation of WSIs is an interesting perspective and the solution of the proposed method by introducing the position-aware masked autoencoder and position-aware cross-attention module is technically sound.

    The idea of introducing anchors and relative position information in the pre-training (reconstruction) process is interesting.

    The paper is well-organized and clearly written.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The proposed method was only evaluated on one public dataset but there are more available, such as the Camelyon16.

    The author compared their method with MAE and HIPT, but there are more works, such as the TransPath, that are worthy to be compared.

    For the performance of the WSI classification, there are more SOTA methods such as CLAS, DSMIL, SETMIL, etc., that are worthy to be compared.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Satisfactory: The author described their method details clearly, and the dataset is publicly available. However, the author did not claim whether they will make their code publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Detailed suggestions are given in the paper’s weaknesses section.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method is technically sound and interesting; however, more experiments are needed and more SOTA methods should be compared to further illustrate the advances of this work.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper presents a new transformer architecture for unsupervised WSI representation learning. Reviewers appreciate the good writing and presentation of the paper and also appreciate the good experimental results. This is a solid paper and should advance WSI analysis. Please try to address the reviewer’s concerns on comparison with more methods, further clarification on motivation and technical details in the final version.




Author Feedback

We thank the reviewers and meta-reviewer for their efforts in our work. We realized it would be helpful to clarify some points based on the reviewer’s comments, which are listed below.

  1. The anchor-based distance and orientation features. In natural scene images, there is natural directional conspicuousness of semantics. For instance, in the case of a church, it is most likely to find a door below the windows rather than be located above them. But histopathology images have no absolute definition of direction. The semantics of WSI will not change with rotation, namely it is isotropic. Therefore, to avoid introducing biases in structure encoding in WSI, we define main orientations based on the kernel attention scores and combine this with relative distance encoding. The ablation study has proven the effectiveness of this design in the consistent semantic integrity of pathological features. Similar idea and operation are presented in SIFT descriptors, where a main orientation needs to be determined for feature construction.

  2. The motivation of the self-supervised WSI representation learning. There are still a vast quantity of unlabeled data in practical scenarios. For instance, in network-based consultation and communication platforms, there are a large number of publicly available WSIs without any annotations or definite diagnosis descriptions. Self-supervised learning methods are significant to make full use of this portion of data. It is also promising to develop cross-organ and pan-cancer histopathological WSI analysis systems.

  3. The technical details of classification task. The distance and orientation are embedded into trainable features by φ_d(·) and φ_p(·) respectively. The embedding values are input as biases in the softmax function to be involved in updating image tokens and kernel tokens. The usage of cls token refers to the MAE framework, which was concatenated with patch tokens. During pre-training, the cls token is not involved in loss computation, but it continuously interacts with kernels and receives global information. After pre-training, the pre-trained parameters of the cls token will be loaded for fine-tuning and linear probing.

  4. Further comparison with other SOTA methods. As suggested by Reviewer #1, we conducted additional weakly-supervised experiments with DSMIL and SETMIL. On the Endometrial dataset, DSMIL achieved AUC/ACC of 0.761/38.21 and 0.786/39.21, respectively, by using 35%/100% labeled WSIs, while SETMIL achieved 0.795/38.71 and 0.831/40.84. As for the TCGA-Lung dataset, on 35%/100% labeled WSIs, DSMIL achieved AUC/ACC of 0.911/75.00 and 0.938/80.11, respectively, while SETMIL achieved 0.937/80.21 and 0.962/84.95. These results have significant gaps to those achieved by our method. We will add these results on Table 2.

  5. The computational resources. The computational complexity of self-attention is O(n^2) where n is the number of patch tokens. In contract, our proposed PACA’s complexity is O(k×n) where k is the number of kernel tokens. Notice that k«n, the complexity is close to O(n), i.e. linear correlation with the size of the WSI.

  6. The hyper-parameters are given in the supplementary materials.

  7. We will correct the typos throughout the paper.



back to top