Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Hanci Zheng, Zongying Lin, Qizheng Zhou, Xingchen Peng, Jianghong Xiao, Chen Zu, Zhengyang Jiao, Yan Wang

Abstract

Nasopharyngeal carcinoma (NPC) is a malignant tumor that often occurs in Southeast Asia and southern China. Since there is a need for a more precise personalized therapy plan that depends on accurate prognosis prediction, it may be helpful to predict patients’ overall survival (OS) based on clinical data. However, most of the current deep learning (DL) based methods which use a single modality fail to effectively utilize amount of multimodal data of patients, causing inaccurate survival prediction. In view of this, we propose a Multimodal Transformer for Survival Prediction (Multi-TransSP) of NPC patients that uses tabular data and computed tomography (CT) images jointly. Taking advantage of both convolutional neural network and Transformer, the architecture of our network is comprised of a multimodal CNN-Based Encoder and a Transformer-Based Encoder. Particularly, the CNN-Based Encoder can learn rich information from specific modalities and the Transformer-Based Encoder is able to fuse multimodal feature. Our model automatically gives the final prediction of OS with a concordance index (CI) of 0.6941 on our in-house dataset, and our model significantly outperforms other methods using any single source of data or previous multimodal frameworks. Code is available at https://github.com/gluglurice/Multi-TransSP.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_23

SharedIt: https://rdcu.be/cVRU3

Link to the code repository

https://github.com/gluglurice/Multi-TransSP

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper
    • Use multimodal network to combine CT image data and text data to predict patients’ overall survival.
    • Demonstrate the effectiveness of Transformer model in this task
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • A architecture to incorporate both CT image and text data to predict patients’ overall survival
    • New state-of-the-art performance on the task.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper needs more discussion on the results (e.g. good/bad examples by case studies, interpretability of model’s decision, how the model can be further improved) to help interpret the impact of the work and interpret what 0.02498 MSE and 0.6941 CI actually mean in terms of quality.
    • The novelty of the work is limited to applying Transformer and multimodality on a new task, which is ok if more clinical insights can be provided to interpret the model performance as mentioned above.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper uses in-house dataset which introduces concern on reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Given the limited number of patients (only 384), the paper may consider cross validation.
    • Is overall survival defined by duration in terms of weeks, days, hours or minutes?
    • The “text data” (age, BMI, dose) is either numerical or categorical, rather than free-text. Strictly speaking, it should be called structured data rather than text data.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The effectiveness of the proposed model is clear though more clinical insights will be helpful.

  • Number of papers in your stack

    2

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors apply a transformer and CNN to predict the overall survival from nasopharyngeal carcinoma patients from CT images. The proposed method is compared to competing techniques on a private dataset. The benchmarked method also include non-imaging methods based on clinical reports (text).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Broad benchmarking even including text
    • Well-motivated task
    • Related work is clearly discussed
    • Figures are well-prepared
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Demographics and details about the non-imaging features used should be added
    • The writing would profit greatly from a thorough grammar checking.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The dataset demographics and non-imaging features needs clarification, otherwise it is reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • If you did not observe death for a subset of patients and treat the time to the last follow-up as the OS time: Do you check for patients that had successful treatment and were therefore not showing up any more?
    • For the error you state (e.g., MSE), please indicate the unit (years, months, …)
    • Please indicate how the segmentations were obtained (e.g., manual, manually by an expert, (semi-)automated method, …)
    • Space embedding: I do not understand the self-learning part of the spatial matrix, since I assume the location information is a value you can simply retrieve from the slice position? Please add more details about:
    • Demographics information of your data
    • The text features you used, since it contains relevant information for the target task
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The experimental setup is sound, and the ablation study clearly shows the contributions of the individual elements. I am missing key information about the data and other used features that this research can be reproduced.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper tackles the task of predicting survival in nasopharyngeal carcinoma patients, by effectively utilizing the information from the CT images and the clinical text data. To do so, the authors propose a novel multi-modal architecture which leverages the feature extraction power of convolutional neural networks and the the feature fusion ability of transformers. The authors demonstrate the efficacy of their proposed model on a small in-house dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper tackles an important problem of fusion between text and imaging data, which is commonly overlooked. Most works focus on fusing the information between different imaging modalities, or imaging and genomics modalities.
    • The paper employs the transformer model to perform the fusion. Given that the transformer model has been shown to work very well in other contexts, it is natural to try this model for key tasks with medical data.
    • The paper provides an elaborate ablation study to evaluate the model.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The details of the use of the transformer model are inadequate. For instance, from the description of the input data in section 2.2, it is unclear what the exact “sequence” is for the input to the transformer. Is it the different feature channels for the same voxel, or the different z-slices of the image? How are the text embeddings included in the input? Where does the expanded text feature join the imaging features? What is N in the “N sequences”?
    • What is a space embedding? Why is it needed? How is it constructed? How does it capture the space? Is this a trainable embedding, or a pre-defined embedding?
    • Why do the authors make use of mean-squared error loss for predicting survival instead of the more commonly used Cox-based losses such as those used in DeepSurv? The MSE loss does not take into account the censoring. Is the y_i used in the MSE loss corresponding to the time of death/last-observed?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The experimental details are sufficiently provided in the text, and the authors indicate that they will make the corresponding code available. However, they will not make the in-house dataset used in this study publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The paper is overall well-structured.

    • Details of the method can be improved as mentioned above. In particular, since one of the key novelty lies in the use of the transformer, this part of the model needs to be explained in good detail. Discussions on why the particular input to the transformer model was chosen is also essential. Why was the particular 2D expansion of text chosen? Has this been shown to be effective in earlier works?
    • The paper will greatly benefit from obtaining performance results on more datasets (e.g. other types of cancer) or different data splits of the same data. It is really hard to compare and contrast different methods based entirely on one run of one-split of the dataset.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper presents an interesting approach to solve an important problem in multi-modality fusion of imaging and text data. Further evaluation on more datasets (e.g. of other cancers, or on different splits of the same data) could greatly help in understanding the true advantage of the proposed method.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper uses transformer to integrate clinical and imaging features for cancer survival prediction. Results show the effectiveness of the proposed framework. Reviewers provide some constructive comments, such as details about non-imaging features and the use of the transformer model, authors are recommended to take them into account. Especially the space embedding is unclear. In addition, some other important baseline approaches are missing, e.g., radiomics + clinical feature in a well-tuned machine learning framework.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3




Author Feedback

We thank all the reviewers (R1, R2, R3, Meta-Reviewer) for their acknowledgement about our methodological contribution, and their constructive comments for further clarification.

Q1: Details about the space embedding. (R2&R3&Meta-Reviewer) A1: Since the valid slices of every patients are different (the range of the manually depicted segmentation maps differ among different patients), it may be difficult to align the slices of all patients. Similar to the position embedding in Transformer, we add a space embedding to each slice of one 3D CT image, so the inter-slice structural information could be kept. This embedding is randomly initialized and learned in the training process, just in case the model aligns different slices of different patients if we simply apply a relative serial number.

Q2: Details of the use of Transformer and 2D expansion of non-imaging data. (R3&Meta-Reviewer) A2: Actually, each input sequence of Transformer is different feature channels of same voxel, and thus we have L sequences (the N in “N sequences” should be changed to L). We will carefully check the symbols and grammar throughout the paper to avoid such typos. For the 2D expansion of non-imaging data, we refer to Guan et al. [11] which applies a 3D transformation to the tabular data. The non-imaging data are concatenated to the image feature right after the space embedding is added to the image feature. Since dimensions of the imaging feature and non-imaging data are different, we need to align them by either expanding tabular data or reducing the dimension of the imaging feature (i.e., the compared method Yap et al. [22]).

Q3: Demographics and details about non-imaging features. (R1&R2&Meta-Reviewer) A3: A table will be added to describe demographics for the image and non-image data of patients. More description about the details of non-imaging features will be added in Section 3.1 (dataset). Moreover, we will replace the term “text data” with the approariate “structured data” or “tabular data” in the final paper as suggested.

Q4: Validation on cross validation & more datasets. (R1&R3) A4: We agree that cross validation can better reflect the reliable performance of our method on such small scale dataset. We will consider this in the future work. In addition, more public datasets containing survival data on other oncologies are considered to be used in our future work, which will be clarified in the final paper.

Q5: The unit of overall survival and MSE. (R1&R2) A5: The unit of overall survival is month. Note that for all the non-imaging data and survival labels, we normalize them to [0, 1] before passing them to the framework, so the unit of MSE loss is also normalized.

Q6: More discussion on the results & more clinical insights & future improvements. (R1) A6: For experiment results, we will consider adding more analysis to explain the effectiveness of each module. Also, more clinical background information will be explained in the “introduction” section. In the future, we consider utilizing a more effective way to extract image features as well as to fuse multimodal features, which will also be clarified in the final paper.

Q7: Definition of OS time. (R2) A7: Indeed, Overall Survival (OS), defined as the time from randomization to death (from any cause), is a direct measure of clinical benefit to a patient. And we collected data using the telephone follow-up means which is somewhat private, so this may be considered as a limitation to the study, which will be stated in the final paper.

Q8: How were the segmentations obtained? (R2) A8: The segmentation maps were manually depicted by experts. We will clarify this in the final paper.

Q9: Why using MSE loss instead of Cox-based losses and what is y_i? (R3) A9: We initially used MSE loss for simplicity. And we will consider Cox-based losses in future work. Besides, the y_i in the MSE loss is the survival label of one NPC patient.



back to top