Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Iain Carmichael, Andrew H. Song, Richard J. Chen, Drew F.K. Williamson, Tiffany Y. Chen, Faisal Mahmood

Abstract

Supervised learning tasks such as survival prediction that involve large whole slide images (WSIs) are a critical challenge in computational pathology that require modeling complex features of the tumor microenvironment. These learning tasks are often solved with deep multi-instance learning (MIL) models that do not explicitly capture intratumoral heterogeneity. We develop a novel variance pooling architecture that enables a MIL model to incorporate intratumoral heterogeneity into its predictions. Two interpretability tools based on “representative patches” are illustrated to probe the biological signals captured by these models. An empirical study with 4,479 gigapixel WSIs from the Cancer Genome Atlas shows that adding variance pooling into existing MIL frameworks improves survival prediction performance for five cancer types.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_38

SharedIt: https://rdcu.be/cVRrV

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose an attention variance pooling method to aggregate patch-level features into a WSI-level representation, which explicitly considers intratumoral heterogeneity. Experimental results show that adding this pooling method into existing multi-instance learning frameworks improves survival prediction in five WSI datasets of different cancer types.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A novel patch-level features pooling method taking tumor heterogeneity into account.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    None

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Very good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. It is not clear why the authors chose the specific five cancer types from TCGA. They said the choice of these types was based on the number of patients. However, there are some other cancer types in TCGA with more patients than some of the selected cancer types.
    2. The sentence below is difficult to understand. Please consider rephrasing them: – Since the number of patches is different across WSIs, we cap – by random subsampling– the number of instances in each bag at the 75% quantile for that cancer type. The 75% quantile of what?
    3. According to the results in Table 1, the increase of c-index is only 5.3% for the Deep Sets method on the COADREAD dataset, instead of 10.4% mentioned in the text. Please double check. 4 In conclusion section, no need to provide the full form of MIL again.
    4. The number of patches (i.e., 75% quantile) used for training and the number of projection vectors are two key parameters. Sensitivity analysis of them should be performed.
    5. Not clear how to incorporate VarPool into DeepGraphConv.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript is well written and organized and contains novel elements.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    In this paper, the authors propose an attention variance pooling module, that helps multiple instance learning (MIL) frameworks learn intratumoral heterogeneity (ITH) information from encoded patch features of whole slide images (WSIs). This module can be incorporated into existing MIL frameworks for survival prediction tasks. Authors also provide two metric scores for interoperability. Experiments on five datasets from The Cancer Genome Atlas (TCGA) demonstrate the effectiveness of the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well written and easy to follow.
    2. The idea of capturing the intratumoral heterogeneity information is novel and interesting in pathology image analysis. The motivation and the implementation method are well explained.
    3. The proposed module can be easily incorporated into existing MIL frameworks and improve the performance.
    4. Visualizations of patches (selected via attention scores, or ordered variance projection scores) show a good interpretability of the proposed method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The MIL architectures in the experiments are not well introduced. “Deep Sets” architecture is never mentioned before the experiment section.
    2. Lack of implementation details. The process of incorporating variance pooling components to “Deep Sets” (without attention mechanism) and “Attn MeanPool” (with attention mean pool modules) should be different, but no details are provided.
    3. The selection of some experimental settings are not explained. For example, variance pooling projections K=10 and nonlinearity=log().
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    1. All five datasets are publicly accessible.
    2. The authors stated that code will be publicly available.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Related work. I suggest the authors add more introductions about the selected three MIL architectures in the related work.
    2. Implementation details. I suggest the authors provide more implementation details on incorporating proposed modules to different architectures.
    3. Experimental settings. I suggest the authors do more ablation studies and explain their choice of some hyperparameters or network configurations.
    4. Cross validation. Instead of randomly splitting the data 10 times, a more straightforward way to show the stability and generalizability might be 10-fold cross-validation.
    5. Baseline experiments. For MIL architectures without attention mechanisms, it is not clarified whether the mean pool module is used or not. If yes, the performance gain from “MeanPool” to “MeanPool + VarPool” is more important to prove the effectiveness of the proposed variance pool module. If not, I suggest the authors add another baseline with only “MeanPool”.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The idea of capturing the intratumoral heterogeneity information is novel and interesting in pathology image analysis.
    2. Experiments with open-sourced code (authors stated they will make code publicly available) and public datasets are reproducible.
    3. The experimental results demonstrate that incorporating the proposed module to MIL frameworks improves survival prediction performance.
  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Intratumoral heterogeneity in WSIs have been extensively studied in the past and ongoing efforts are focused towards characterizing the structural/appearance differences across different diagnostic subtypes. This paper addresses a key challenge of modeling complex fetaures of the tumor microenvironment. The authors have developed a novel variance pooling architecture that enables an MIL model to empower the inherent intratumoral heterogeneity into deep learning models. The authors also provide two interpretability tools to investigate the biological signals captured by the models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors have developed a framework, based on variance pooling which aims to capture intratumoral heterogeneity by quantifying ITH as variance along a collection of low-rank projections of patch features.
    2. The authors have designed variance projection contrast visualization to provide model interpretability, which is clinically relevant. Further, the ability to obtain the biological signals captured by the deep learning architectures further promotes trustworthiness of the system upon expert’s approval (in this case a pathologist).
    3. The mathematical formulation of the proposed strategy is well described.
    4. The implementation details are well described.
    5. The choice of parameters, loss functions, parameter initializations are justified.
    6. An exhaustive set of results from different experiments with standard measures is provided.
    7. Visual illustration of the proposed interpretability tool highlights diagnostically useful regions which can be useful for assisting pathologists in decision making.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors describe two interpretability approaches. However, it is unclear how this interpretability approach would address some of the routine diagnostic workflows. Could the authors describe some of the structural alterations as the spectrum progresses from negative and positive SAsqr values?
    2. Did any discussions with the pathologists happen prior to planning this study for providing model interpretations? Pathologist guided studies will help in developing models that could help in clinical adoption.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have provided sufficient details for reproducibility of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Could the authors expand on the ideas of interpretability as mentioned above?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has some novelty, however it needs to address some questions described above.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes an attention-based variance pooling over the patches to improve the performance of MIL frameworks. The proposed method is novel and effective. The reviewers suggest enriching the paper by adding more details for the experiments design and results discussion, e.g., hyper-parameters selection and comparison.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2




Author Feedback

We thank the reviewers for the valuable comments, which helped us greatly improve the quality of the manuscript. We will incorporate the suggested changes into the camera-ready version of the manuscript. Below, we respond to the reviewers’ suggestions.

Comment: The choice of 5 TCGA types

We chose these 5 TCGA types because they had a large number of samples and we attempted to closely follow the experimental setup of previous work in this area (Chen et al, 2021).

Comment: Clarification on implementation (number of patches, architecture implementation) and experimental details (random resample vs. cross-validation)

We will clarify the language about patch subsampling and add further details on the architecture implementation. For the Deepsets architecture, we use uniform attention weights across the patches. For the graph baseline, we compute the variance of the features produced from the series of graph-convolutional layers, just before these are fed into MLP. Additionally, the released code will contain the implementation details.

Due to the small sample size and significant number of unobserved survival events, we concluded that the 70/30 (%) train/test set split was the best balance. We also wanted to have more than 5 dataset resamples to improve stability. Therefore we chose the resampling scheme to decouple the number of resamples from the train/test set proportions. This scheme is also used in similar papers, e.g. (Lu et al, 2021).

Comment: Sensitivity analysis (activation function, number of var pools, loss function)

We conducted informal sensitivity analyses on both the activation function (logarithm, square root, and sigmoid functions) and number of variance pools. The results did not change significantly. Due to the space constraints of the manuscript, we did not include a formal sensitivity analysis on these parameters. We do provide a sensitivity analysis for the loss function (ranking and cox survival loss) in the appendix.

Comment: Interpretability usage

At this point in time, most interpretability tools are primarily used in research settings. Bringing these interpretability tools to clinical practice is an active area of research, which we are currently thinking of as the future work. The SAsqr interpretability tool is in line with standard MIL tools (e.g. attention heatmaps).

Comment: Consultation with pathologists

The idea for this manuscript came out of extensive discussion with pathologists about where existing MIL architectures fall short. These discussions helped us understand the critical importance of intratumoral heterogeneity in determining cancer response to treatment and ultimately survival. In these discussions we realized that the existing MIL architectures do not capture intratumoral heterogeneity, a problem we sought to address in this manuscript. Two of the co-authors are pathologists.

Bib

Chen, R.J., Lu, M.Y., Shaban, M., Chen, C., Chen, T.Y., Williamson, D.F. and Mahmood, F., 2021, September. Whole Slide Images are 2D Point Clouds: Context-Aware Survival Prediction using Patch-based Graph Convolutional Networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 339-349). Springer, Cham.

Lu, M.Y., Williamson, D.F., Chen, T.Y., Chen, R.J., Barbieri, M. and Mahmood, F., 2021. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering, 5(6), pp.555-570. We thank the reviewers for the valuable comments, which helped us greatly improve the quality of the manuscript. We will incorporate the suggested changes into the camera-ready version of the manuscript. Below, we respond to the reviewers’ suggestions.

Comment: The choice of 5 TCGA types

We chose these 5 TCGA types because they had a large number of samples and we attempted to closely follow the experimental setup of previous work in this area (Chen et al, 2021).

Comment: Clarification on implementation (number of patches, architecture implementation) and experimental details (random resample vs. cross-validation)

We will clarify the language about patch subsampling and add further details on the architecture implementation. For the Deepsets architecture, we use uniform attention weights across the patches. For the graph baseline, we compute the variance of the features produced from the series of graph-convolutional layers, just before these are fed into MLP. Additionally, the released code will contain the implementation details.

Due to the small sample size and significant number of unobserved survival events, we concluded that the 70/30 (%) train/test set split was the best balance. We also wanted to have more than 5 dataset resamples to improve stability. Therefore we chose the resampling scheme to decouple the number of resamples from the train/test set proportions. This scheme is also used in similar papers, e.g. (Lu et al, 2021).

Comment: Sensitivity analysis (activation function, number of var pools, loss function)

We conducted informal sensitivity analyses on both the activation function (logarithm, square root, and sigmoid functions) and number of variance pools. The results did not change significantly. Due to the space constraints of the manuscript, we did not include a formal sensitivity analysis on these parameters. We do provide a sensitivity analysis for the loss function (ranking and cox survival loss) in the appendix.

Comment: Interpretability usage

At this point in time, most interpretability tools are primarily used in research settings. Bringing these interpretability tools to clinical practice is an active area of research, which we are currently thinking of as the future work. The SAsqr interpretability tool is in line with standard MIL tools (e.g. attention heatmaps).

Comment: Consultation with pathologists

The idea for this manuscript came out of extensive discussion with pathologists about where existing MIL architectures fall short. These discussions helped us understand the critical importance of intratumoral heterogeneity in determining cancer response to treatment and ultimately survival. In these discussions we realized that the existing MIL architectures do not capture intratumoral heterogeneity, a problem we sought to address in this manuscript. Two of the co-authors are pathologists.

Bib

Chen, R.J., Lu, M.Y., Shaban, M., Chen, C., Chen, T.Y., Williamson, D.F. and Mahmood, F., 2021, September. Whole Slide Images are 2D Point Clouds: Context-Aware Survival Prediction using Patch-based Graph Convolutional Networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 339-349). Springer, Cham.

Lu, M.Y., Williamson, D.F., Chen, T.Y., Chen, R.J., Barbieri, M. and Mahmood, F., 2021. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering, 5(6), pp.555-570.



back to top