Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jing Xu, Yuan Gao, Wei Liu, Kai Huang, Shuang Zhao, Le Lu, Xiaosong Wang, Xian-Sheng Hua, Yu Wang, Xiang Chen

Abstract

Skin tumor is one of the most common diseases worldwide and the survival rate could be drastically increased if the cancerous lesions were identified early. Intrinsic visual ambiguities displayed by skin tumors in multi-modal imaging data impose huge amounts of challenges to diagnose them precisely, especially at the early stage. To achieve high diagnosis accuracy or precision, all possibly available clinical data (imaging and/or non-imaging) from multiple sources are used, and even the missing-modality problem needs to be tackled when some modality may become unavailable. To this end, we first devise a new disease-wise pairing of all accessible patient data if they fall into the same disease category as a remix operation of data samples. A novel cross-modality-fusion module is also proposed and integrated with our transformer-based multi-modality deep classification framework that can effectively perform multi-source data fusion (i.e., clinical images, dermoscopic images and accompanied with clinical patient-wise metadata) for skin tumors. Extensive quantitative experiments are conducted. We achieve an absolute 6.5% increase in averaged F1 and 2.8% in accuracy for the classification of five common skin tumors by comparing to the prior leading method on Derm7pt dataset of 1011 cases. More importantly, our method obtains an overall 88.5% classification accuracy using a large-scale in-house dataset of 5601 patients and in ten skin tumor classes (pigmented and non-pigmented). This experiment further validates the robustness and implies the potential clinical usability of our method, in a more realistic and pragmatic clinic setting.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_60

SharedIt: https://rdcu.be/cVRuL

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed a transformer based multi-modality classification framework for skin tumors to simulate the diagnostic process of dermatologists in realistic situations. The proposed method achieved SOTA results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    As above

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    However, there are some places that are not clearly expressed.

    In Section 2.2, “When DWP is turned on (based on p > Tp, p ∈ [0,1])”, what does p mean here? How is the p obtained?

    What are the global features and local features?

    In fig. 2, what does patch token mean? It’s never shown in the main text.

    Section 2.3 is very confusing. For example, the authors mentioned gc or gd are generated by GAP layer, gm is generated by LN layer. But in the formulas below, it’s zx and gx’ that are generated by LN and GAP. I don’t know if the gx’ here represents the same gc, gd, gm mentioned before, or it means something else. Also, I suggest to add the notations lc, ld, and lm(if there is one) to the figure 2.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    However, there are some places that are not clearly expressed.

    In Section 2.2, “When DWP is turned on (based on p > Tp, p ∈ [0,1])”, what does p mean here? How is the p obtained?

    What are the global features and local features?

    In fig. 2, what does patch token mean? It’s never shown in the main text.

    Section 2.3 is very confusing. For example, the authors mentioned gc or gd are generated by GAP layer, gm is generated by LN layer. But in the formulas below, it’s zx and gx’ that are generated by LN and GAP. I don’t know if the gx’ here represents the same gc, gd, gm mentioned before, or it means something else. Also, I suggest to add the notations lc, ld, and lm(if there is one) to the figure 2.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    However, there are some places that are not clearly expressed.

    In Section 2.2, “When DWP is turned on (based on p > Tp, p ∈ [0,1])”, what does p mean here? How is the p obtained?

    What are the global features and local features?

    In fig. 2, what does patch token mean? It’s never shown in the main text.

    Section 2.3 is very confusing. For example, the authors mentioned gc or gd are generated by GAP layer, gm is generated by LN layer. But in the formulas below, it’s zx and gx’ that are generated by LN and GAP. I don’t know if the gx’ here represents the same gc, gd, gm mentioned before, or it means something else. Also, I suggest to add the notations lc, ld, and lm(if there is one) to the figure 2.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    However, there are some places that are not clearly expressed.

    In Section 2.2, “When DWP is turned on (based on p > Tp, p ∈ [0,1])”, what does p mean here? How is the p obtained?

    What are the global features and local features?

    In fig. 2, what does patch token mean? It’s never shown in the main text.

    Section 2.3 is very confusing. For example, the authors mentioned gc or gd are generated by GAP layer, gm is generated by LN layer. But in the formulas below, it’s zx and gx’ that are generated by LN and GAP. I don’t know if the gx’ here represents the same gc, gd, gm mentioned before, or it means something else. Also, I suggest to add the notations lc, ld, and lm(if there is one) to the figure 2.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    The paper proposes a disease- wise pairing of all accessible patient data. Further a cross-modal fusion module is also proposed and integrated with a transformer based multi-modality fusion module.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1)The idea of multi-modal fusion with transformer is smart. 2) Experiments show superiority of the model in terms of F1 score and Accuracy.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Although the chosen problem is exciting, technical novelty is still missing. Authors use state of the art Swin Transformer for a given task. 2) The proposed architecture seems to have multiple branches for each modality, however the computation complexity, number of parameters are not discussed. 3) In table 1, the second best performing network would be Inception-comb. With respect to this the performance exceeds by 5.5% in terms of average accuracy and not 12 %. 4) Augmentation might not be clinically sound. 5) Experiments sound a bit weak. 6) May be please add significance test to check the significance of the results.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The model looks computationally expensive, hence technically it should be reproducable. This claim is supported by experiments on two dataset for one application. Being said that, it is also the model seems computationally expensive.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    1) Although the chosen problem is exciting, technical novelty is still missing. Authors use state of the art Swin Transformer for a given task. 2) The proposed architecture seems to have multiple branches for each modality, however the computation complexity, number of parameters are not discussed. 3) In table 1, the second best performing network would be Inception-comb. With respect to this the performance exceeds by 5.5% in terms of average accuracy and not 12 %. 4) Augmentation might not be clinically sound. 5) Experiments sound a bit weak. 6) May be please add significance test to check the significance of the results.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper show an exmaple of how to use a transformer based model for nulti-modal data fusion. Experiments are performed two datasets. But technical novelty is missing, so I weakly accept it.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I am satisfied with the answers given in the rebuttal.



Review #3

  • Please describe the contribution of the paper

    A cross-modality-fusion module integrated with transformer-based multi-modality deep classification framework that can fuse multi-source data (i.e., clinical images, dermoscopic images and accompanied with clinical patient-wise metadata) for skin tumors. Validation on 1011 cases.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • a new multi-modal cross-fusion transformer for multi-modality data fusion
    • Disease-wise Pairing as Augmentation to address the problems of missing modality.
    • a new cross-modality fusion model to use global features
    • solid experiments
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • It is hard to reproduce the method as lots of details are missing in Figure 2. How patch are embeded? What is RS? What is M?
    • parameter settings of detailed architecture not disclosed.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is very difficult to reproduce the paper as the abbreviations in the flow chart, and detailed parameter settings of the architecutre are now disclosed in full detail. It is suggest to release the code, at least the model architecture part.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    To demonstrate the contributions in addressing missing modality, it is suggested to give details of percentage of missing data. Besides, experiments should include validations on different percentages of missing data and modality.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has clinical application metrits and technical innovations. The paper will benefit the research community. The only concern is the reproducibility. If the authors could release the code, that would be great.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposed a transformer based multi-modality classification framework for skin tumors to simulate the diagnostic process of dermatologists in realistic situations. The proposed method achieved good performance. General speaking, the paper is well organized, but there are still some comments need to be concerned. Please revise the paper carefully and we will then consider it for acceptance.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5




Author Feedback

We thank all reviewers for their constructive comments.

R2Q1: The goal of this work is to develop a more clinically suitable computer-aided diagnosis system for skin tumors. Our system works for a broader disease spectrum, including both pigmented and non-pigmented skin tumors. Effectively utilizing multi-source/modality data is essential for skin tumor precision diagnosis. We put the most efforts on designing effective multi-modal fusion modules and the corresponding training methods, which have been largely overlooked by the community so far. We proposed a simple yet effective multi-modal framework where any transformer-based backbones can be plugged in. Swin Transformer was chosen simply due to its good general performance.

Design choices:

R2Q4,R3: Disease-wise pairing (DWP) is not a pure implementation trick. It has sound and logical clinical evidence. Dermatologists look for and fuse different semantic disease-level characteristics in three modalities and can make inferences without aligning them at the precise/rigid patient-level. Features/information of the three modalities are complimentary, so that it is beneficial to shuffle/fuse them in the semantic space. Compared to making up missing data using zero padding, DWP improves F1 by 2.2% and 0.9% when 50% and 60% dermoscopic images are missing, respectively. If 50% and 60% meta data is missing, DWP can improve upon zero padding by 2.0% and 3.8%, respectively.

R2Q6: The effectiveness of the cross-modality fusion module (CMF) is further validated with a t-test by running the experiments 10 times. The p-values of the average F1 and Top1 accuracy are 1.8e-4 and 7.2e-6 (p < 0.01), indicating that the effectiveness of CMF is statistically significant compared to the “concatenate” operation.

R2Q2: For computational cost, we design a shared feature extractor for both clinical and dermoscopic images. As described in the Results, our model has much less parameters than SOTA methods such as FusionM4Net (32.3M vs.54.4M). Our method is end-to-end. The inference time (28ms) is much faster than FusionM4Net, which takes dozens of seconds to search for optimal fusions.

Implementation details: Our backbone is a regular Swin Transformer, parameters such as patch size, window size, layer numbers, and output feature dimension are the same as Swin-T/224 or Swin-B/384. One layer of CMF is appended after the backbone in our experiments. In the CMF module, the dimensions of the input local features and global features are the same as the dimension of the output feature of the backbone, which is 768 in Swin-T, and 1024 in Swin-B, and the number of heads of Multi-Head Self-Attention is 8.

R1,R3: From Swin Transformer, patch tokens are generated by a linear embedding layer after a patch splitting module from the input images. Local feature is derived from the feature map of the last stage and the global feature is output from a global average pooling (GAP) layer. In Fig. 2, random sampling (RS) means to input a sample by DWP or zero padding; C, D and, M are the three modalities as shown in Sec.2.2. We will add more details into Fig. 2 for better illustration.

R1: T_p is the a cutoff threshold that controls the probability of applying DWP to the input samples, where p is a random number generated by a uniform distribution on the interval [0, 1]. For each input sample, we apply DWP when p>T_p (T_p is empirically set to 0.6).

R1: In Sec. 2.3, l_c and l_d are local feature from Swin Transformer (no l_m). For g_x’, GAP is only used on l_c and l_d to generate g_c (x’=c) or g_d (x’=d). g_m (x’=m) is generated by a linear layer. z_x is a temporary feature after layer normalizing (LN) local and global features.
The percentages of missing data are 0.4%, 33.6% and 37.7% for clinical images, dermoscopic images and meta data, respectively (see details in Data). The average F1 improvement on derm7pt is 6.5%.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Based on the feedback of the authors and the combined comments of the reviewers, we have decided to accept this paper. The authors have done a great job rebutting concerns - including clarifying quantitative data to support their findings, that cross-modal fusion and disease-wise pairing are innovative. It is recommended that this be better emphasized in the final version.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I think this is an overall interesting paper. The disease-wise pairing of training samples seems reasonable. The use of the dot-product attention for cross-modality fusion, as pointed out by R2, is an interesting idea. The authors have also clarified the reviewers’ concerns in the rebuttal.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Authors have done a good job of rebutting concerns - including clarifying the quantitative data to back up their findings. While the transformer itself is not novel in conception, the cross-modality fusion and disease wise pairing are innovative. Suggestion to highlight this better in the final version.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



back to top