Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Zhilong Lv, Rui Yan, Yuexiao Lin, Ying Wang, Fa Zhang

Abstract

Microsatellite instability (MSI) is a crucial biomarker to clinical immunotherapy in gastrointestinal cancer, while additional immunohistochemical or genetic tests for MSI are generally missing due to lack of medical resources. Deep learning has achieved promising performance in detecting MSI from hematoxylin and eosin (H&E) stained histopathology slides. However, these methods are primarily based on patch-supervised slide-label models and then aggregate patch-level results into the slideslevel result, resulting unstable prediction due to noisy patches and aggregation ways. In this paper, we propose a joint region-attention and multi-scale transformer (RAMST) network for microsatellite instability detection from whole slide images in gastrointestinal cancer. Specifically, we present a region-attention mechanism and a feature weight uniform sampling (FWUS) method to learn a representative subset of image patches from whole slide images. Moreover, we introduce the transformer architecture to fuse the multi-scale histopathology features consisting of patch-level features with region-level features to characterize the whole slide images for slide-level MSI detection. Compared to the existing MSI detection methods, the proposed RAMST shows the best performances on the colorectal and stomach cancer dataset from The Cancer Genome Atlas (TCGA) and provides an effective features representation learning method for WSI-label tasks.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_29

SharedIt: https://rdcu.be/cVRrM

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper, the author proposed a new transformer based approach for MSI detection, which is a WSI MIL problem.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The framework is reasonable. The result is promising.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The method proposed is quite complicated and details of a critical step is missing: FWUS. I cannot find description of how the scores of FWUS is generated. Since transformer does not make any assumption on the correlation between instance, one major problem of adopting transformer is the demand of large amount and rich training data. However, for this task, the amount of training data seems to be limited to a few hundreds or thousands. Thus I am a little bit skeptical on the result.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Code will be made available. Experiment is based on public dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    It will be nice if the author can visualized the token attentions on top of the WSI image to show what has been attend to by transformer. Cropping patch from image region and extract CNN features is quite intuitive and standard. I would suggest authors simplify the method in this part and give more details on other aspects such as how Epos is designed – is it in WSI space or region image space? Is Epos different between x20, x5, and thumbnail? If so, how? Section 2.2 is kind of duplicated to 2.1. It can be simplified or combined with 2.1 to make more room for other more critical contents. Minor: Some of the references seem to be missing in the paper, for example: … state-of-the-art method DeepSMILE [] by 0.18%. … MSI detection methods for comparison []. Typos: … where c is the number of feature map channels..

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Reasonable method with some novelty and performance gain. Some details are missing but can be fixed.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper describes a method to detect Microsatellite Instability from whole slide images stained with HE, as well as from regions. The method uses an attention map to sample patches to get more predictive power and uses a transformer architecture to combine two levels of infomation, region level and patch level in forms of extracted features. The region level architecture is leveraged to build a slide level architecture by aggregation regions in a modified architecture.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper uses powerful ML techniques such as attention and transformer architectures to solve a challenging and relevant problem. The architecture proposed is sound and novel and has valuable contributions. The sampling method using attention map sounds like a good idea, and is shown to perform well. The combination of region level and patch level features using transformers is also a good idea and seems to perform well for this task. The combination of auxiliar and primary losses is a positive addition and it is convincingly explained. The aggregation of regions in the slide level architecture is also very valuable.
    The results show a superior performance compared to state of the art methods for MSI. The ablation study is convincing and shows the superiority of using slide level and the use of the sampling technique with the attention map. The paper is well written and the main concepts reasonably well explained. Fig 1 is very informative and informs visually very well about how the method is designed.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Since my theoretical knowledege of transformers is not excellent, I find insufficient the details given in the paper regarding their use. For example, it is unknown to me what are the positional embeddings and what’s the role of the class token. Either a brief explanation or some references would be appreciated.

    It is unclear the use of the thumbnail image in the whole slide level architecture. The magnification level of that image is not given.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Details on the transformer implementation, and the classifiers used, are not well detailed in the paper or suplementary material as well as other modules shown in Fig 1 (MLP, CNN, attention module). However, this is less relevant since the authors are providing the code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I appreciate the work done by authors in this paper, and I think it is in general a solid work.

    The authors should clarify better the classification goal of the method (number of classes): my understanding is that it is a binary classification problem positive or negative for MSI, which I beleive is not clearly stated in the paper.

    There are repetitive sentences in the paper that can be removed, for example the showt summary of RAMST is reapeted in the abstract, the introduction (page 2 last paragraph), introduction again (page 3 second paragraph), method last paragraph

    Page 4 line before eq (1), M’{fwus} should be X^P{fwus}

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I recommend acceptance of the paper given the novelty of the method, the convincing results and the proper explanations in the text. The weaknesses I see are mainly regarding to lack of details in some sections so I see way more strenghts than weaknesses and I think the paper adds value to the MICCAI community.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    A transformer based model was proposed to detect microsatellite instability status from while slide images, which outperforms existing patch-supervision methods on the gastrointestinal cancer data set from TCGA. To preserve representative features and remove noisy and redundant data, a feature weight sampling method was proposed.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Well motivated proposed approach, good analysis of the problem at hand, latest deep learning based approach to solve problem, careful experiments and clear analysis of the results, comparison with state-of-the-art methods are given.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Nothing

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Datasets and experimental setup are clearly mentioned.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Nothing

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Main strengths of the paper.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The authors proposed RAMST, a joint region-attention and multi-scale transformer network for MSI status prediction of whole-slide images in gastrointestinal cancer. Specifically, a feature weight uniform sampling method is used to learn representative features of image regions and a transformer architecture is used to fuse the region-level tissue features with patch-level cell features. The proposed method outperformed existing MSI detection methods on the colorectal and stomach cancer datasets from TCGA.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The use of attention mechanism and sampling strategy in this paper is quite novel. It is essential to learn with WSI data, as it allows the selection of representative regions and patches within the large WSIs for effective predictions. Moreover, the use of transformer encoders on multi-level features are quite novel as well, and it allows effective exploration of the correlations between multiple same-level and multi-level features.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the proposed method is quite novel and the paper is also clearly written, some parts of the results section could be further improved as listed in the detailed comments below.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the paper is good. The authors provided necessary details from the reproducibility checklist.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The setup of training, validation, and testing data needs further clarification. For example, are the frozen slides from the 206 patients in the test dataset also used in training or validation? Does the compared methods in Table 1 all use the same training and testing data split setup? Ideally, we would want to apply the compared methods towards the same set of training, validation, and testing data for a fair comparison. It would also be better if the authors could fill in the results for different compared methods on the STAD dataset in Table 1.

    From Fig.2, it seems that choosing an appropriate sampling is very important to the performance of the proposed algorithm. From Fig.2 and Fig. S3, it seems that 25% sampling rate outputs the best result. Will it be a good default parameter choice for other researchers who want to utilize the proposed method?

    One advantage of the proposed method is the ability to sample from the attention weights of patches in order to eliminate noisy and redundant patches. Could the authors provide some examples and visualizations to compare the patches with high weights and low weights in order for the readers to understand the proposed method more intuitively?

    Minor: It seems that the citations for the compared methods were lost in the 5th and 10th line of the “Performance Evaluation” section. It would be better if the authors could also include the training time needed for different compared methods.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is quite novel as it combines attention-based sampling to utilize representative parts of the WSIs with transformers that allows effective integration of multi-scale features. The paper itself is also clearly written and the authors provided detailed results, especially on the analysis of the choice of sampling rate in ablation studies.

  • Number of papers in your stack

    1

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposed RAMST, a joint region-attention and multi-scale transformer network for MSI status prediction of whole-slide images in gastrointestinal cancer. The paper proposed a few novel components, such as the attention mechanism, sampling strategy and the use of transformer encoders on multi-level features. The paper is well organized and the main concepts reasonably well explained. The ablation experiments are convincing and show superior performance.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2




Author Feedback

We sincerely thank the Area Chair and reviewers for the valuable comments. Meanwhile, we are excited to contribute our work to the MICCAI community. The clarifications for the comments are detailed as follows:

  1. (R1) The proposed method is quite complicated and details of the FWUS is missing: Thanks for reviewers’ comments. As stated in the first two paragraphs of Section 2.1, the feature weight uniform sampling (FWUS) method selects patches by uniform sampling based on the ranked feature weight map, which are derived by channel pooling from region-level feature map extracted by CNN.

  2. (R1) The amount of training data for the proposed method seems to be limited. Thanks for reviewers’ comments. The proposed region-attention and multi-scale transformer (RAMST) is a hybrid-architecture model, which can leverage the strengths of different architectures as well as reduce model complexity. Specifically, we used the pre-trained CNNs for features extraction and thus use fewer transformer encoders responsible for feature fusion. As shown in supplementary material Table S1, we trained the region-level RAMST on nearly 190,000 images and thus obtained a well-trained region-attention module, patch-level features extraction module. Then, we fine-tuned the WSI-level RAMST via transfer learning on WSI slides, where the transformer was pre-trained based on random sampling patch images.

  3. (R1) The details on other aspects such as how E_pos is designed. Thanks for reviewers’ comments. We used two-level positional embeddings E_pos=[loc_region, loc_patch], where loc_region indicates the position of the region and loc_patch indicates the 2-D position of the sampled patches in region.

  4. (R1, R3, R4) Repetition and Typos: Thanks for reviewers’ comments. We will revise repetitive sentences and fix errors in the camera-ready paper.

  5. (R1, R4) Visualizations of region attention Thanks for reviewers’ comments. We will add it in future journal expansion articles.

  6. (R3) The authors should clarify better the classification goal of the method (number of classes): Thanks for reviewer’ comments. Exactly as pointed out, the goal of the method is a binary classification problem of microsatellite instability (MSI) or microsatellite stability (MSS). We will clearly state the classification problem in the camera-ready paper.

  7. (R3) Supplement for transformers and thumbnail of WSI Thanks for reviewers’ comments. The transformer was proposed in 2017 by Google Brain team for natural language processing (NLP) tasks [Ref_13] and Vision Transformer (ViT) [Ref_14] is the most representative work of transformers in the fields of computer vision. The positional embeddings aim to add the relative or absolute position information of the tokens to the input sequence [Ref_13] and the class token is used as the aggregate sequence representation for classification tasks [Ref_BERT].

For the whole slide images in SVS file format, the first image in an SVS file is the baseline image (full resolution) and the second image is a thumbnail.

Ref_BERT: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT (1) 2019: 4171-4186

  1. (R4) The setup of training, validation, and testing data needs further clarification. As the first deep learning-based MSI detection study in gastrointestinal cancer, Kather et al. provided patient-level split training set and testing set [Ref_4] (https://zenodo.org/record/2530835 and https://zenodo.org/record/2532612). Therefore, subsequent MSI detection models including ours are based the same patient-level split dataset.



back to top