Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Ting Jin, Xingran Xie, Renjie Wan, Qingli Li, Yan Wang

Abstract

Histology analysis of the tumor micro-environment integrated with genomic assays is the gold standard for most cancers in modern medicine. This paper proposes a Gene-induced Multimodal Pre-training (GiMP) framework, which jointly incorporates genomics and Whole Slide Images (WSIs) for classification tasks. Our work aims at dealing with the main challenges of multi-modality image-omic classification w.r.t. (1) the patient-level feature extraction difficulties from gigapixel WSIs and tens of thousands of genes, and (2) effective fusion considering high-order relevance modeling. Concretely, we first propose a group multi-head self-attention gene encoder to capture global structured features in gene expression cohorts. We design a masked patch modeling paradigm (MPM) to capture the latent pathological characteristics of different tissues. The mask strategy is randomly masking a fixed-length contiguous subsequence of patch embeddings of a WSI. Finally, we combine the classification tokens of paired modalities and propose a triplet learning module to learn high-order relevance and discriminative patient-level information. After pre-training, a simple fine-tuning can be adopted to obtain the classification results. Experimental results on the TCGA dataset show the superiority of our network architectures and our pre-training framework, achieving 99.47% in accuracy for image-omic classification. The code is publicly available at https://github.com/huangwudiduan/GIMP.



Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_49

SharedIt: https://rdcu.be/dnwJ4

Link to the code repository

https://github.com/huangwudiduan/GIMP

Link to the dataset(s)

https://portal.gdc.cancer.gov/


Reviews

Review #2

  • Please describe the contribution of the paper

    The authors present a multi-modal pre-training approach that consiers both WSIs and genomic information. They show on a binary TCIA classification task that multi-modal integration works, that pretraining enhances performance, and that their approach is better than previous ones

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Addresses the problem of multi-modal data integration in a elegant and innovative way.
    • Well written
    • Method well described.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No strong weaknesses, but:

    • I am curious to see how ‘Genomic’ alone performs in Table 1
    • Instead of four digits in the Tables, I would prefer mean and std.
    • I do not get why and how you generate triplets from a single WSI/omics pair. Can you describe the approach in simple terms?
    • Explainability: Can you check if the genes with high attention are meaningful for the classification problem at hand?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • TCIA data is open
    • Code is not provided
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • Abstract:
      • please simplify ‘Then, we design a masked patch modeling paradigm that masks random patch embeddings from a fixed-length contiguous subsequence of a WSI to capture the latent pathological characteristics of different tissues’
      • superiority compared to what?
    • Intro
      • Pls provide more information in the fig 1 caption, eg about the abbreviations used.
    • Experiments: Pls motivate your ablation study.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    good read, important problem, novel approach

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The authors propose a new frame work to incorporate gene expression cohort by multi-head self-attention. A new multi-modal fusion strategy based on triplet-loss is proposed to fuse gene and wsi feature

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -First to include the whole gene expression cohort and fuse with WSI data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -Weak evaluation. the author only evaluated their method on 1 subtype task (LUAD/LUSC), which is weaker than most of works in the field (For example, [17], [4] in the reference). Also, they do not perform cross-validation, which is less rigorous than [17] and [4] which at least use 5-fold cross-validation.

    -Lack of interpretability: It will be interesting to see which gene-group are helpful for subtyping and how the gene expression data correlated to path in WSI. Since the model is proposed in a medical domain, I feel like certain transparency is necessary.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It’s good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Evaluate the proposed method in more subtype, using more rigorous approach such as X-fold cross validation, and there is room to improve in model transparency.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Most of my points are mentioned in weakness part. The solid experiment design makes me lean toward reject.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors proposed their GMP pipeline using both the genomics and H&E WSIs with the lung cancer classification use case. he proposed work has the clinical potential by using the two most commonly used modality dataset, genomics and biopsy slides, in clinic and build an AI-based automatic classification tool.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The imaging masking techniques applied also be helpful in improving learning representations from WSIs and could address the missing modality challenges in some degree.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No major weakness, one minor thing could be taken into consideration is the computational time to train such a multimodality pipeline. The other minor point is the comparison of the proposed pipeline with unimodality-based approaches have been provided, however, if could also include the comparison with conventional simple fusion approaches, such as early- or late-fusion, which could show more values of the proposed architecture.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors did include the dataset information, training computational infrastructure, together with the training hyper-parameters. No significant concern regarding the reproducibility of the proposed work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The authors proposed their GMP pipeline using both the genomics and H&E WSIs with the lung cancer classification use case. he proposed work has the clinical potential by using the two most commonly used modality dataset, genomics and biopsy slides, in clinic and build an AI-based automatic classification tool. The imaging masking techniques applied also be helpful in improving learning representations from WSIs and could address the missing modality challenges in some degree. One minor concern is the training effectiveness of such a multimodal fusion architecture using 2 large modality dataset, the authors did not point out the training time, which is a important factor to be considered in deploying such pipeline in clinical practice.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The detailed illustration of the pipeline, the innovations of the proposed multimodality pipeline.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper present a multi-modal pre-training approach that consiers both WSIs and genomic information and evaluate on a a binary TCIA classification task. All reviewers appreciate the importance of the paper to addresses the problem of multi-modal data integration and the first two reviewers (R2 and R3) found no significant weakness of the paper. The last reviewer feel the evaluation of the method is a bit weak (only one task, not as much as [4] and [17]). But these two reference, one is about radiology images and therefore not a direct comparison ([17]) and the other is a journal paper which usually have much more space to make in-depth evaluation. Though other weakness mentioned by reviewers such as attention of gene an cross-attention of gene on WSI is indeed interesting to see or somewhat necessary if it is a journal submssion. In my opinion, this is still a strong paper with minor weakness and therefore recommend early acceptance.




Author Feedback

We thank all reviewers and the AC for constructive feedback. In the following, we will address the major concerns one by one.

1)Genomic data @R2 is curious to see how genomic data alone performs in Table 1 and questions that if the genes with high attention are meaningful for pathological classification. GroupMSA using genomic data alone achieves 93.12% in accuracy. Besides, we discard the genomic groups with high attention scores in the top half and we observe that GroupMSA drops from 93.12% to 87.89% ACC (random drop leads to 91.58% ACC) while the overall framework GiMP drops to 95.79%. Regarding to @R4 which genomic groups are helpful for subtyping and how the gene expression data are correlated to WSIs, we will further explore in-depth evaluation from a clinical perspective in our future work.

2)Evaluation @R4 points out that we only evaluated our method on TCGA-NSCLC dataset. Since gastric cancer (GC) is one of the leading causes of cancer related mortality worldwide and gene expression is also regarded as an important data source for GC diagnosis. Following [1], in this work, we evaluated our proposed method on a classical classification task, TCGA-NSCLC classification, to exploit the complementary relationship of genomic data and pathological images. More analysis on the generalizability of other diseases will be explored in the future. [1] Multi-level Multiple Instance Learning with Transformer for Whole Slide Image Classification, NIPS 2023.

3)Computation analysis @R3 questions that the training time should be pointed out, which is an important factor to be considered in deploying such pipeline in clinical practice. The overall computation analysis (compared to the pre-trained methods) is summarized as follows. Note, the maximum pre-training epoch for all methods is set to 100 (about 0.6 hours limited by hard drive loading speed), and our GiMP shows higher training efficiency.

model Gflops
GiMP (ours) 12.7517
BioViL 21.6356
REFERS 25.6107
MGCA 55.2245

Minor: @R2: please simplify ‘Then, we design a masked patch modeling paradigm that masks random patch embeddings from a fixed-length contiguous subsequence of a WSI to capture the latent pathological characteristics of different tissues’.

We design a masked patch modeling paradigm (MPM) to capture the latent pathological characteristics of different tissues. The mask strategy is randomly masking a fixed-length contiguous subsequence of patch embeddings of a WSI.

@R2: why and how you generate triplets from a single WSI/omics pair. Can you describe the approach in simple terms?

We generate triplets inside a mini-batch. Like contrastive learning methods mentioned previously, each element in the mini-batch will be treated as the anchor, and then we select a positive sample with the same label as the anchor and a negative sample with the contrary label to build triplet set.

@R3: If could also include the comparison with conventional simple fusion approaches, such as early- or late-fusion, which could show more values of the proposed architecture.

Earlier works on multimodal fusion focus on early fusion and late fusion. In our comparison method, PORPOISE and Pathomic Fusion are late fusion based methods and MCAT is early fusion based method.

@R2: please motivate your ablation study.

We evaluate the effectiveness of each component of GiMP framework in Sec 3.3, and demonstrate the motivation of each module.

@R2: please provide more information in the Fig.1 caption, e.g., about the abbreviations used.

GiMP is short for Gene-induced Multimodal Pre-training, CLS is short for Class Token (indicated in the legend).



back to top