Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Gianpaolo Bontempo, Angelo Porrello, Federico Bolelli, Simone Calderara, Elisa Ficarra

Abstract

The adoption of Multi-Instance Learning (MIL) for classifying Whole-Slide Images (WSIs) has increased in recent years. Indeed, pixel-level annotation of gigapixel WSI is mostly unfeasible and time-consuming in practice. For this reason, MIL approaches have been profitably integrated with the most recent deep-learning solutions for WSI classification to support clinical practice and diagnosis. Nevertheless, the majority of such approaches overlook the multi-scale nature of the WSIs; the few existing hierarchical MIL proposals simply flatten the multi-scale representations by concatenation or summation of features vectors, neglecting the spatial structure of the WSI. Our work aims to unleash the full potential of pyramidal structured WSI; to do so, we propose a graph-based multi-scale MIL approach, termed DAS-MIL, that exploits message passing to let information flows across multiple scales. By means of a knowledge distillation schema, the alignment between the latent space representation at different resolutions is encouraged while preserving the diversity in the informative content. The effectiveness of the proposed framework is demonstrated on two well-known datasets, where we outperform SOTA on WSI classification, gaining a +1.9% AUC and +3.3% accuracy on the popular Camelyon16 benchmark. The source code is available at https://github.com/aimagelab/mil4wsi.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_24

SharedIt: https://rdcu.be/dnwcB

Link to the code repository

https://github.com/aimagelab/mil4wsi

Link to the dataset(s)

https://camelyon16.grand-challenge.org/Data/


Reviews

Review #2

  • Please describe the contribution of the paper

    The paper proposes a graph based message passing approach for integrating multi-scale information in MIL. They additionally incorporate consistency losses to improve performance across scales. The experiments and ablations show the usefulness of the proposed method and the loss functions

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and covers relevant previous work. The local connectivity based graph for fusing information within and across scales is simple and intuitive. The use of consistency based losses for regularization is interesting. The ablations are good and show the relevance and sensitivity of various losses and hyper-params.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors mention their approach helps incorporate spatial connectivity in multi-res MIL. They should look to compare with methods like SETMIL (https://conferences.miccai.org/2022/papers/462-Paper0415.html) which allow doing that and incorporate transformers for aggregating information. It’s unclear how the graph based message passing compares to more recent transformer based feature aggregation (SETMIL, TransMIL) in terms of performance, memory and compute and the authors should talk about this comparison.

    In the results comparison, the authors mention the results were referenced from previous papers. But the TransMIL numbers seem different from those reported in the original paper. Secondly, a lot of the papers use custom splits for TCGA NSCLC, so the numbers from different methods aren’t comparable unless they are re-implemented and tested on the same set.

    It’s unclear how the predictions y_1 and y_2 from different resolutions are combined to generate the final slide-level prediction.

    For the knowledge distillation loss, why is the temperature term applied to both student and teacher outputs?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The results are on publicly available datasets and the authors promised to share the code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The paper is well written and the key ideas and experiments are described well.

    But there are gaps in the way the evaluation was done and it would be good to see more discussion and comparison of the graph based message passing with recent transformer based approaches.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well written with interesting ideas and relevant experiments.

    There are gaps in the evaluation which need to be addressed, but overall merits an accept

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    This paper presents a novel method for integrating multi-scale patch information using graph networks for WSI classification. The paper includes a variety of experiments and comparisons.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • strong literature review
    • novel aspects in methodology to integrate multi-scale info
    • strong evaluation, with improved results demonstrated compared to SOTA
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • lack of discussion on the computational complexity/time vs the SOTA
    • lack of discussion on the limitations of the proposed work
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The relevant details are described clearly, the code has already been published.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • table 1 shows that the results on TCGA Lung are comparable for single-scale and DAS-MIL (with slight improvement for the latter). however, there is significant improvement for Camelyon16 with DAS-MIL compared to single-scale. this needs to be discussed in the paper. the results on TCGA Lung raises the question if there is a need for multi-scale method.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The presented method is novel with a lot of experiments and comparisons included. The results are promising, evaluation is strong.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This work introduces a novel multiple instance learning (MIL) method for whole slide image (WSI) classification that exploits the true pyramidal data structure of WSI via graph neural network (GNN)s and self-distillation across scales i.e., student-teacher learning strategy across scales. As opposed to existing multi-scale approaches, representations across are aligned via knowledge distillation learning objectives for both instance and bag classifier(s) per scale to enhance diversity and preserve information. By leveraging a graph-based message passing across scales for patches within the same neighborhood, subsequent representations are more robust. Experimental results with several ablations reveal superior performance across all evaluation settings with significant margins on benchmark histopathology datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is very clear, easy to follow and motivates the ideas in a coherent manner.

    • The authors address a very relevant task in histopathology image understanding. As opposed to many existing single-scale methods, I found the presented ideas very timely. The need for multi-scale approaches cannot be understated.

    • The proposed approach is novel and interesting, including the learning objectives. While graph-based approaches are gaining popularity for this task, the authors have a unique perspective and formulation, especially the use of KD across scales.

    • Extensive experiments on benchmark datasets with ablations sufficiently support the proposed ideas.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Potentially missing comparison: The author briefly discussed a related method [-], but did not include it in the evaluation. As both this work and the former leverage DINO in some fashion, which ‘representation consistency’ across scales (though technically, one is graph based and the other uses transformers) - It would be beneficial to better understand the significance of this work in that aspect.

    [-] Chen et al. “Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning”. CVPR 2022.

    • Discussion regarding inference strategy: The proposed method can scale to more than (M=2) scales, in the current design the higher scale serves as a teacher while the rest as students. It is unclear how inference is performed. Is the highest scale employed or ensembling similar to [ref. 28 – Zhao et al.] was used? I may have missed this part.
    • Sensitivity to hyper-parameters: The method appears sensitive to the hyper-parameters (Tables. 2 and 3), it is unclear if this is isolated to the Camelyon16 (CM16) dataset. Did the authors observe similar behavior on TCGA? Notably when $\alpha,\beta,\tau = 1.0$ accuracy improvements are marginal per-scale on CM16.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Code provided with all relevant implementation and experimental details. Fairly reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Minor suggestions:

    Add a color bar in Supplementary Figure 1 with a clearer color-space (red-blue).

    Qualitatively, are the predictions consistent (localization probabilities) across scales?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work is novel and has a clear motivation, with strong empirical results supporting the designs. The extensive experiments across different settings provide a clear view of the utility of the method. Aside from the missing evaluation with a related method, and clarifications on inference. I support acceptance of the paper in its current form.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All the three reviewers appreciated the idea of using multi-scale patch information via graph networks and self-distillation for WSI classification. While the reviewers gave high scores (one weak accept and two strong accepts), they raised some questions or suggestions to improve the manuscript: lack of discussion on the computational complexity and limitations, some additional comparisons and discussion regarding inference strategy. Please consider these suggestions to improve the paper when preparing the final version.




Author Feedback

We thank the reviewers for their time, expertise, and thoughtful reviews. Although the assessments are generally positive, we are going to respond/clarify the main concerns. Reference with * is missing in the original paper.

  1. Computational Complexity [R1,R2]: Tab. 1 of the supplementary material compares computational complexity and time requirements of a single scale approach with and w/o the graph module (the latter represents DSMIL [18]), and the proposed DAS-MIL (numbers calculated on a single RTX2080). We avoid considering the overhead induced by the feature extractor. To provide a broader analysis, additional comparisons including more recent feature-based aggregation methods will be included:

HIPT [5]: Train 15.90s - Test 8.65s - N. params 1.9 * 10^8 - Mem 783MB TransMIL [23]: Train 11.93s - Test 5.08s - N. params 4 * 10^5 - Mem 1.6MB

As can be appreciated, our approach does not involve such a huge overhead: moreover, considering the delicate nature of the task and the eventual implications of misclassification, we advocate putting predictive capabilities first.

  1. Missing Comparisons [R2,R3]: Compared to [18], our method computes multi-resolutions simultaneously, and not in a sequential fashion, which is instead subject to propagation errors between modules. HIPT [5] is similar to SETMIL [1*] since it uses a ViT with positional encoding. While [1*] directly incorporates the positional encoding in the attention mechanism, our method splits the problem into two distinct modules: a 2-tier GNN that applies a “masked attention” to integrate spatial context in the instance representations that are then fed into a second attention module that applies a masked non-local operation. Unfortunately, SETMIL does not provide the training source code and releases models that are pre-trained on different datasets w.r.t. those we considered in our setup. This prevents us from fairly comparing with 1*. HIPT performance is reported below.

  2. Evaluation [R2]: We apologize for any confusion regarding the evaluation of results and the use of numbers from different sources. To clarify, numbers are taken from two distinct sources [18,26] which both use the same train/test splits. In the revised manuscript, the caption of Tab. 1 will be rewritten to make this clear. For the sake of completeness, we also replicated on our premises the experiments of most competitors (some are missing due to time constraints). The following numbers are all obtained using the same feature extractor, i.e., DINO (* identifies CA16, ° is TCGA-LUNG):

​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​​ ​ ​Acc* ​ ​AUC* ​ ​Acc° ​ ​​AUC° Max-Pool. 0.893 0.899 ​ 0.851 0.909 Avg-Pool. 0.723 0.672 ​ 0.823 0.905 ABMIL ​ ​ ​ ​ ​ 0.724 0.744 ​ 0.864 0.933 TransMIL ​ 0.883 0.942 ​ 0.881 0.948 DSMIL ​ ​ ​ ​ ​ ​0.915 0.952 ​ 0.888 0.951 HIPT ​ ​ ​ ​ ​ ​ ​ ​ 0.898 0.951 ​ 0.890 0.950

  1. Clarifications [All]: Regarding the combination of predictions from different resolutions, we confirm that those obtained on the higher scales are used for generating the final slide-level prediction. In short, the model distills the knowledge from the higher scale to the lower one and, since connected by the GNN, this improves the prediction capabilities of the higher scale itself. For this reason, we refer to this process as self-knowledge distillation. This will be explicitly mentioned in the revised version of the paper.

Comparing C16 and TCGA datasets, there is a significant difference in the signal intensity. The former has roughly <10% of tumor tissue, while the latter has >80% of tumor regions per slide. In this sense, the contextualized representation provided by our DAS-MIL is much more effective in the first case.

Regarding the knowledge distillation loss, the temperature term is applied to both student and teacher outputs to ensure numerical stability in the distillation process.



back to top