Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jacob Gildenblat, Anil Yüce, Samaneh Abbasi-Sureshjani, Konstanty Korski

Abstract

Weakly supervised classification of whole slide images (WSIs) in digital pathology typically involves making slide-level predictions by aggregating predictions from embeddings extracted from multiple individual tiles. However, these embeddings can fail to capture valuable information contained within the individual cells in each tile. Here we describe an embedding extraction method that combines tile-level embeddings with a cell-level embedding summary. We validated the method using four hematoxylin and eosin stained WSI classification tasks: human epidermal growth factor receptor 2 status and estrogen receptor status in primary breast cancer, breast cancer metastasis in lymph node tissue, and cell of origin classification in diffuse large B-cell lymphoma. For all tasks, the new method outperformed embedding extraction methods that did not include cell-level representations. Using the publicly available HEROHE Challenge data set, the method achieved a state-of-the-art performance of 90% area under the receiver operating characteristic curve. Additionally, we present a novel model explainability method that could identify cells associated with different classification groups, thus providing supplementary validation of the classification model. This deep learning approach has the potential to provide morphological insights that may improve understanding of complex underlying tumor pathologies.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_75

SharedIt: https://rdcu.be/dnwKy

Link to the code repository

N/A

Link to the dataset(s)

HEROHE ECDP2020: https://ecdp2020.grand-challenge.org/

National Cancer Institute GDC Data Portal: https://portal.gdc.cancer.gov/

CAMELYON17 Grand Challenge: https://camelyon17.grand-challenge.org/Data/


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a novel method of combining patch and cell level embeddings for whole slide image classification. The authors apply their method to 4 classification tasks on 4 different datasets and they show that their novel method of combining cell and patch embeddings outperforms just using patch embeddings, which was the prior norm. The authors also introduce an explainability method on top of their proposed architecture

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is nicely written and the concepts are explained with concision, there is the correct level of detail here to follow the story and to appreciate the method in a few pages There is good justification for the modelling decisions made across the board - the cell and patch embedding combination is well reasoned, the explainability method seems to work The authors show that using their combined embeddings they can achieve better performance, they show that this is the case across all 4 datasets and 2 architectures Figure 2 is very nice, although it would be better if it appeared earlier in the text The authors show that their method is the new SOTA on a challenge - which is an important contribution

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main contribution of this paper is that tile embeddings do not capture meaningful information about the cells in the tiles. This may be true, but I don’t think there was enough evidence to back this claim up in the paper. This needs to be backed up with references to other works that show this, or the authors own experiments, ideally both if possible. I believe that the only evidence for this in the paper is figure 1, where the authors claim that the last layer does not focus on cells. However, the authors contradict themselves by showing that the second last layer does focus on cells - it is not clear to me from this why current models do not focus on cell level features. Also, this example is only 1 patch, out of potentially millions of patches in the dataset, is there a way to summarize this? The authors report the performance results using only a single run and no cross validation. It is nice to see that it is always the case that the combined embeddings out perform the tile embeddings, however to strengthen this claim the authors should consider adding cross validation to quantify the variation as in many cases the difference is small What model was used for cell segmentation, any details on this would be good How are Xformer and A-MIL chosen, there are newer models that could perform better The Xformer reference points to attention is all you need. There is surely a better reference for this? The proposed explainability method is interesting. However, the authors make biological claims about what the model focuses on that are not backed up (section 4.1) - there are no references here, ideally a pathologist should also confirm these. As someone who is not an expert in biology reading this paper I cannot accept these claims without evidence or pathologist confirmation Since the MIL downstream models have their own explainability/attention mechanisms, did the authors examine how these attention mechanisms agree/disagree with theirs, why should we use the authors new explainability method when these methods already exist? I am wondering about the described explainability method, can this really be referred to as “attention”? Based on the authors description, these are weights that are learned through training, and are fixed after training? The point of attention is that the attention weights are variable at inference time and depend on the inputs, not only in training, does the authors method achieve this? Perhaps I have mis understood, some clarification would be good. It also says that the weights are intialised to 1, and then it says they are initialised to 0, this is not clear

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    ok, if the code is public

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The main contribution of this paper is that tile embeddings do not capture meaningful information about the cells in the tiles. This may be true, but I don’t think there was enough evidence to back this claim up in the paper. This needs to be backed up with references to other works that show this, or the authors own experiments, ideally both if possible. I believe that the only evidence for this in the paper is figure 1, where the authors claim that the last layer does not focus on cells. However, the authors contradict themselves by showing that the second last layer does focus on cells - it is not clear to me from this why current models do not focus on cell level features. Also, this example is only 1 patch, out of potentially millions of patches in the dataset, is there a way to summarize this? The authors report the performance results using only a single run and no cross validation. It is nice to see that it is always the case that the combined embeddings out perform the tile embeddings, however to strengthen this claim the authors should consider adding cross validation to quantify the variation as in many cases the difference is small What model was used for cell segmentation, any details on this would be good How are Xformer and A-MIL chosen, there are newer models that could perform better The Xformer reference points to attention is all you need. There is surely a better reference for this? The proposed explainability method is interesting. However, the authors make biological claims about what the model focuses on that are not backed up (section 4.1) - there are no references here, ideally a pathologist should also confirm these. As someone who is not an expert in biology reading this paper I cannot accept these claims without evidence or pathologist confirmation Since the MIL downstream models have their own explainability/attention mechanisms, did the authors examine how these attention mechanisms agree/disagree with theirs, why should we use the authors new explainability method when these methods already exist? I am wondering about the described explainability method, can this really be referred to as “attention”? Based on the authors description, these are weights that are learned through training, and are fixed after training? The point of attention is that the attention weights are variable at inference time and depend on the inputs, not only in training, does the authors method achieve this? Perhaps I have mis understood, some clarification would be good. It also says that the weights are intialised to 1, and then it says they are initialised to 0, this is not clear

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main contribution of this paper is that tile embeddings do not capture meaningful information about the cells in the tiles. This may be true, but I don’t think there was enough evidence to back this claim up in the paper. This needs to be backed up with references to other works that show this, or the authors own experiments, ideally both if possible. I believe that the only evidence for this in the paper is figure 1, where the authors claim that the last layer does not focus on cells. However, the authors contradict themselves by showing that the second last layer does focus on cells - it is not clear to me from this why current models do not focus on cell level features. Also, this example is only 1 patch, out of potentially millions of patches in the dataset, is there a way to summarize this?

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors proposed to training nuclei-level CNN based on contrastive representation learning besides the common patch-level CNN for better patch feature embeddings of histopatothology patch images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    – A novel histopathology image feature extraction approach. This article describes a plug and play embedding extraction method that combines tile-level embeddings and cell-level embeddings, which improves the accuracy of the model and provides interpretability at the cellular level. – Multiple datasets evaluation. The method was evaluated on three public datasets and one inhouse dataset. – This paper is well written and easy to read.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    – Lack comparison with other patch embedding methods. The main contribtuion of the method lies in its ability to plug and play and provide interpretability at the cell level, but it lacks comparison with other advanced embedding methods that introduce cellular level, such as HACT[1] and SHGNN[2]. – The proposed method is time/computation-consuming. Section 2.2 mentioned all the nuclei in a tile are sent to the cell-level ResNet. The computation amount and memory cost will be tens of times compared to the common feature extraction methods. – The experimental results on CAMELYON16 are less convincing. As reported in [3], CLAM and TransMIL can reach an AUC about 0.90 while not introduce cell-level embedding.

    [1]Pati, P., et al.: Hierarchical graph representations in digital pathology. Medical Image Analysis, 75. 102264 (2022).https://doi.org/10.1016/j.media.2021.102264 [2]Hou, W., Huang, H., Peng, Q., Yu, R., Yu, L., & Wang, L. (2022). Spatial-Hierarchical Graph Neural Network with Dynamic Structure Learning for Histological Image Classification. In MICCAI(Medical Image Computing and Computer-Assisted Intervention) (pp. 181–191). https://doi.org/10.1007/978-3-031-16434-7_18 [3]Tourniaire, P., Ilie, M., Hofman, P., Ayache, N., & Delingette, H. (2023). MS-CLAM: Mixed supervision for the classification and localization of tumors in Whole Slide Images. Medical Image Analysis, 85. https://doi.org/10.1016/j.media.2023.102763

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The experiments were conducted on public available dataset. The authors did not claim the code will be released. Whereas, the described algorithm is easy to implement.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    –Perhaps providing a comparison with other methods that introduce cell-level features can highlight the superiority of the method, as this article’s method has its unique characteristics (plug and play, interpretability at the cellular level). –Section 4.1 mentioned a cellular average embedding re-formulation, where w_i is confusing, please provide more detail about w_i since your method do not introduce attention module. –I suggest the authors developing an integrated network to extract a global level of nuclei features in a tile, which has an consistent computation amount and memory cost for all the tiles. It will be more valuable and significant for histopathology image feature embedding.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    – The approach to extracting cell-level features is time/computation consuming, which is not appropriate to build clinical applications. – The experimental comparion is insufficient.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors present a method for learning features from whole-slide images (WSI) using both tile-level and cell-level information. The approach can be easily integrated into existing multi-instance (MIL) frameworks. The authors also contribute a model explainability method which measures the correspondence between the model activations and the cells. The presented method is easy to use with existing frameworks, improves performance on various various WSI classification tasks across four datasets and offers new insights between cellular morphology and disease biology.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    [Clarity]

    • The manuscript is clearly written and easy to follow.
    • Authors provide good explanations for design choices which are well grounded and easy to understand.
    • Figures are clear with defined notations and are informative.

    [Simplicity]

    • The presented method is simple and thus can be integrated easily into many frameworks working on WSIs.
    • The explainability method is also well motivated and easy to implement yet effective.

    [Strong Findings & A Potential for Large Impact]

    • The contribution of the combined embeddings is clear in terms of performance improvements and is delivered effectively. The gains generalize across multiple datasets.
    • Cellular explainability method can be applied in general and may have large impact on the field of digital pathology in discovering new insights for computer vision driven biomarkers for cancer treatment.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    [Additional ablations]

    • Cell-level embedding algorithm: the ablation studies to measure the impact of blacking out non-nuclei region and modified GAP missing.
    • In Section 2.3, the authors make a design choice to include tiles with >= 10 cells per tile. It makes intuitive sense but some justification for this choice (either a reference or an ablation study) will be good to have.

    [Evaluation]

    • I suggest the authors to add confidence intervals to the experimental results in Figures 4-5.

    These are not major weaknesses of the paper as these details do not detract from the main narrative of the paper. The additional ablation results and more thorough evaluation with confidence bounds will make the experimental results of the paper stronger.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Given the simplicity of the presented method, I believe it can be reproduced without much problem.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Most of these comments may be useful for extending this paper for a follow up high impact journal paper to venues such as Nature Comms, MedIA etc. Many of these suggestions will improve the paper but I think they are beyond the scope of this MICCAI submission which I think is already well put-together with a clear narrative and strong experimental results.

    [Evaluation]

    • The suggestion to include confidence intervals comes from the motivation to actually accentuate the effectiveness of the presented approach. The readers should be able to more clearly understand that the improvements from the proposed method is statistically significant over the baseline.
    • The suggested ablations will make the experimental results much more comprehensive. However, I do understand that MICCAI offers limited space and such details may have been omitted due to space constraints.

    [The pretrained backbone]

    • I strongly agree with the author’s statement in Section 2 “using backbones pretrained on natural images is not optimal”. The choice of using ResNet50 pretrained with a SSL method such as BYOL makes sense and can be a good baseline but I think it can be improved further for getting more performance out of the presented framework. I suggest the authors to refer to a very recent CVPR paper [1] to be published in CVPR 2023 regarding bench marking various SSL methods for digital pathology. One of the biggest strengths of this MICCAI work is its ‘plug-and-play’ nature so I am curious to see the performance improvements by plugging in a stronger SSL-pretrained backbone directly from [1] or adopting SSL insights offered from it.

    [1] Benchmarking Self-Supervised Learning on Diverse Pathology Datasets. Mingu Kang, Heon Song, Seonwook Park, Donggeun Yoo and Sérgio Pereira. CVPR 2023.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a very well executed paper with clear results and potentially large impact to the field of digital pathology in general. I view the impact as two-fold:

    • Technical contribution: the method is simple and can be integrated into existing multi-instance learning frameworks. Given that the implementation is straight forward (hopefully code is released :D ), it can provide large benefits to researchers in this domain.

    • Clinical contribution: the explainability model can provide new insights to researchers in oncology/pathology and may even lead to new discoveries in the field of ai-driven biomarkers.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This work introduces a method of combining patch and cell level embeddings for whole slide image classification. The approach was applied to 4 classification tasks on 4 different datasets. The approach can be easily integrated into the existing multi-instance learning frameworks. The authors also contribute to a model explainability. Cellular explainability method can be applied in general and may have large impact on the field of digital pathology in discovering new insights for computer vision driven biomarkers for cancer treatment. The paper is, in general, well written and presented the concepts/methods/results. The authors provide good explanations for design choices which are well grounded and easy to understand. However, there are several concerns on the paper. The authors consider these concerns and improve the clarity and presentation of the paper. The authors may provide theoretical/analytical/practical evidences to support their claims in the paper. The authors assume that tile embeddings do not capture meaningful information about the cells in the tiles without any references or clues provided by the authors. The Fig. 1 may contradict to the authors’ claim. The explainability method is interesting; however, the authors make biological claims about what the model focuses on. This is not explained and supported very well. There is also a concern on the technical aspect of the explainability method as well. Moreover, the authors improve the citations and discussion on the related works.




Author Feedback

We thank the reviewers for providing comments. Below we have summarized their most significant points and provided replies that we hope will be sufficient to allay their concerns. Comment 1: The evidence to support the claim that tile level embeddings do not capture cell level features is not strong. Response: Owing to space limits we could not provide further evidence in the original submission. We used a benchmarking method to quantify the ability of embeddings to capture cell-level features. We randomly sampled 5000 tiles from the cell of origin dataset and, for each tile, measured the average values for cellular features previously described in the literature, and confirmed by our pathologist, to have predictive value: the nuclei area and the mean and standard deviation of nuclei intensity. Using a random forest regressor to predict those features, combined embeddings achieved mean R-squared values of 0.96, 0.97 and 0.79, versus 0.80, 0.87 and 0.69 with tile-only embeddings. These experiments can be added if our paper is accepted. C2: Performance results lack a measure of variability. R: Confidence intervals, obtained through BCa bootstrapping, will be added to our data to demonstrate the variability in model performance for each embedding scheme on each dataset. The generalizability of our combined embedding approach is shown by the improved model performance versus tile-only embeddings across four datasets and two different MIL model architectures. C3: There is a lack of comparison with other patch embedding methods. R: HACT and SHGNN use GNNs to model the tissue microenvironment by learning information from within and between features in images (e.g. nuclei or regions). Similarly, our combined embedding approach captures information at different scales – cell and tile level – that can be extended to multiple scales. The advantage of our approach is that it is a plug and play enhancement for all existing tile-based methods, unlike GNNs, which have dramatically different and more complex architecture. Furthermore, using A-MIL or Xformer with our embedding approach automatically models spatial and hierarchical interactions among features in images. The only overhead of our approach is the additional computational time for cell segmentation, which accounted for an extra ~50% runtime in our benchmark runs. C4: The experimental results on CAMELYON16 are not convincing compared with CLAM and TransMIL that achieve AUC = 0.9 without cell-level embedding. R: It was not our intention to claim state-of-the-art results on the CAMELYON16 dataset. The aim of our paper was not to benchmark classification performance for different MIL methods but rather to demonstrate an embedding extraction method that could be applied to improve performance of many weakly supervised methods – including CLAM and TransMIL. To that end we focus on presenting improvements in two of the most used methods: A-MIL and Xformer. If positional encoding is excluded, TransMIL is like Xformer and achieves a slide-level classification AUC of 0.84 on the CAMELYON16 dataset. This compares with an AUC of 0.86 with Xformer using our combined embedding extraction method. C5: The description of the attention weights in the explainability model is not clear. R: We agree that this could be explained more clearly. The weights are not learned through training. The gradients of the weights per cell (Wi) are calculated at inference time to obtain the per-cell contribution. Wi are initialized to 0, but the weights per embedding are initialized to 1/N, resulting in the average of all embeddings as initialization. We can add this information if our paper is accepted. C6: A pathologist should confirm the biological claims regarding identified features in the explainability model. R: The cellular features identified by our model for each tumour type were validated by a trained expert pathologist who is a co-author of this paper. We can add this information if our paper is accepted.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors introduce deep cellular embeddings that combine patch and cell level information for whole slide image classification. The rebuttal addressed most of reviewers’ major concerns. However, the explainability method and its findings are not sufficiently explained, which should be further improved.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper presents a novel approach of using nuclei-level CNN and contrastive representation learning to enhance patch feature embeddings in histopathology images.

    During the first round of review, the reviewers appreciate the novel insights of incorporating cellular-level contrastive learning, the high quality and clarity of the writing, as well as the superior performance achieved. However, concerns are raised regarding the strength of the claim that tile-level embeddings fail to capture cell-level features, the lack of a measure of variability, the absence of comparison with other patch embedding methods, and the perceived lack of rigor in the performance evaluation.

    In response to these concerns, the author provides a comprehensive rebuttal, summarizing and addressing the raised issues. However, despite the rebuttal, the paper receives two negative reviews and one positive review.

    While I find the idea of incorporating cellular-level contrastive learning interesting, I believe there are more straightforward approaches to achieve this. For instance, the model could incorporate multi-scale images for contrastive learning, such as patches from both 5x and 40x (or even higher) magnifications simultaneously. Such multi-scale methods have been explored in the medical imaging domain without the need for cell identification. Additionally, masking out the background information surrounding the cells may result in the loss of important clinical features. Unfortunately, these aspects are not assessed in the paper.

    Based on these reasons, my recommendation leans towards rejection.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The responses of the authors during the rebuttal process is not convincing enough. We hope that the constructive remarks will help you to improve the work for any future submission.



back to top