Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Weihang Dai, Xiaomeng Li, Taihui Yu, Di Zhao, Jun Shen, Kwang-Ting Cheng

Abstract

Atrial Fibrillation (AF) is characterized by rapid, irregular heartbeats, and can lead to fatal complications such as heart failure. The disease is divided into two sub-types based on severity, which can be automatically classified through CT volumes for disease screening of severe cases.
However, existing classification approaches rely on generic radiomic features that may not be optimal for the task, whilst deep learning methods tend to over-fit to the high-dimensional volume inputs.
In this work, we propose a novel radiomics-informed deep-learning method, RIDL, that combines the advantages of deep learning and radiomic approaches to improve AF sub-type classification. Unlike existing hybrid techniques that mostly rely on naïve feature concatenation, we observe that radiomic feature selection methods can serve as an information prior, and propose supplementing low-level deep neural network (DNN) features with locally computed radiomic features. This reduces DNN over-fitting and allows local variations between radiomic features to be better captured. Furthermore, we ensure complementary information is learned by deep and radiomic features by designing a novel feature de-correlation loss. Combined, our method addresses the limitations of deep learning and radiomic approaches and outperforms state-of-the-art alternatives, achieving 86.9% AUC on the AF sub-type classification task.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43990-2_15

SharedIt: https://rdcu.be/dnwLp

Link to the code repository

https://github.com/xmed-lab/RIDL

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    In this study, the authors present a novel framework for predicting atrial fibrillation (persistent vs. paroxysmal) using CT volumes centered at the left atrium and the corresponding region of interest - which is EAT in this case. The classification framework brings technical novelty by combining radiomics features at multiple resolutions of the deep convolutional classifier. It is interesting to note that the authors use de-correlation loss between the neural network features and the radiomic features, which encourages redundancy elimination. Overall, the study is elaborate with good technical contribution and it tackles a pertinent clinical problem as well.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strengths of this work are:

    1. The classification framework is novel ensembling strategy for using radiomics features as priors for deep learning-based classification
    2. The authors concatenate multi resolution radiomics features to capture the local changes for disease characterization
    3. Channel attention is appropriately used to learn the relative importance of deep and radiomics features
    4. The de-correlation module encourages identification of complementary features
    5. The authors conduct an ablation study for further insights
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some weaknesses that may be addressed are:

    1. Did authors conduct any statistical tests between the AUCs of reported methods?
    2. Is it important to have EAT mask/ROI as a second input? What if the model doesn’t need that? 3.Some cohort characteristics like age, sex, race, may also be reported
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Fair

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. The authors should conduct statistical tests for comparing the AUCs - we suggest DeLong’s test
    2. The authors should also experiment without using the EAT ROI as an input for both radiomics as well as the deep learning-based model. It is possible that the authors might end up discovering some novel biomarkers/singatures outside of the EAT that may be highly predictive.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The classification framework provides good technical contribution for combining radiomics and deep learning. The experiments are exhaustive, and the framework design has been developed minor details in consideration. The ablation study also adds to the clarity of the results.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    The reviewers have addressed my concerns. Overall, the paper provides a unique and effective way of combining deep and radiomics features. The method could be clinically useful.



Review #3

  • Please describe the contribution of the paper

    The authors use a novel approach for fusing radiomics and deep learning features. They combine local radiomic features computed for each cubic patch and early layer deep learning features to obtain better local context. They also ensure only de-correlated features are fused to obtain complementary information from deep features and radiomic features. They claim this fusion approach achieves state of the art performance for screening patients with high risk of persistent atrial fibliration.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The approach to hybrid modeling used by the authors is novel. Feature fusion at low level combined with de-correlation loss achieves the top results. The authors also show extensive ablation experiments to show the performance improvement achieved by the model is the effect of the novel contributions of their methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of the paper is that the authors have reported performance using 5-fold cross validation. They also mention feature selection being done using cross validation. As mentioned in this paper - “Inconsistent Partitioning and Unproductive Feature Associations Yield Idealized Radiomic Models” by Gidwani et. al., cross validation based feature selection and performance reporting leads to leakage from training set to test set and thus leads to significant performance inflation. Dividing the data into Train, Validation and Test (holdout) set is necessary, with feature selection done using Train set, hyperparameter tuning done using Train + Validation set and performance reported on Test (holdout) set is crucial.

    The performance gain in AUC is 1.1%. Statistical significance testing using DeLongs Test on the prediction probabilities of the Baseline classifier vs RIDL classifier must be performed.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Dataset is not available, Code will be available if paper is accepted.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Good work, with novel feature combination strategy. Although the feature fusion approach needs rigorous validation with the use of holdout set or external dataset, as the performance improvement over baseline is ~1%.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Performance improvement could be due to inflation associated with cross validation based performance reporting. Also statistical significance over baseline needs to be conveyed.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The study aims to improve the classification of two sub-types of Atrial Fibrillation (AF) through a novel method that combines deep learning and radiomic approaches. The authors propose the use of locally computed radiomic features to supplement low-level deep neural network features and address the issue of over-fitting. The authors also design a feature de-correlation loss to ensure complementary information is learned by deep and radiomic features.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper proposed a new way of combining radiomic features with low-level DNN fea- tures and encouraged complementary deep and radiomic features to be learned.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There is a limitation in terms of technique novelty:

    1. It is unclear which specific method is being utilized for feature selection methodologies, thus there is limited discussion on this aspect.
    2. The calculation time for these features is important as efficiency is a major concern, and further clarification is needed on this point.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Code will be available upon acceptance.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. What is the patch size used for the patch-wise feature, and how are these features combined? Is feature selection performed at the patch-wise level?
    2. Since the size of the patch-wise features should be large, would it affect the network training?
    3. Can you provide references for the statement, “We note that only texture radiomic features are used for local calculation since they are specifically intended to capture local context.”?
    4. What is the reason for using global and local features separately? Have experiments been conducted using different feature sets as inputs of low or high-level feature concatenation?
    5. Why are parameters not used for the reconstruction loss in Equation 6?
    6. How was overfitting addressed, given the size of the dataset?
    7. Is the dataset private? Are there any publicly available datasets that can be used?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    technological innovation

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Propose a novel radiomics-informed deep-learning method, RIDL, that combines the advantages of deep learning and radiomic approaches to improve AF sub-type classification

    • Novel ensembling strategy for using radiomics features as priors for deep learning-based classification with feature fusion and decorrelation loss
    • Network design and module are well detailed
    • Reasonable comparison to SOTA
    • Ablation study presented
    • ~1% performance improvement reported, this should be contextualized
    • No statistical analysis provided, cross validation only on limited dataset




Author Feedback

MR1: Thanks for the positive comments. Validation and performance issues are addressed in R3 6.1-6.2.

R2: 6.1) Statistical significance The DeLong AUC test statistic of RIDL vs. Baseline is 1.46. Improvements are statistically significant at p<0.10. 6.2) EAT ROI ROI masks are compulsory for calculating radiomic features. Naïve and hybrid baseline DNNs trained without EAT ROI also have poor results with AUC of 71% and 82% respectively. In this task, EAT ROI provides additional context for learning better features. 6.3) Cohort statistics Age and gender distributions are shown in S-Fig. 2. We will include race data.

R3: 6.1) Cross-validation To clarify our cross-validation method, we split the dataset into 5 chunks and use 3 chunks for training, 1 for validation, and 1 for testing in each fold. We use a rolling strategy so different chunks are used for the test and validation splits in each. Only training and validation splits were strictly used for feature and hyper-parameter selection in each fold. In effect, we performed strict train-validation-test split evaluation with 5 different splits. This evaluation is even more rigorous since it accounts for variations from different splits. For example, when we used a single train-validation-test split for RIDL, we obtained 89.9% AUC vs 87.5% AUC for the baseline, a better result, but this is misleading as it ignores the effect of different splits. To further demonstrate that there is no test data leakage, our method only has one parameter w_corr, and S-Fig. 4b shows that results outperform the baseline for all w_corr values tested, with stable performance between [1.5,3]. 6.2) Statistical significance The DeLong AUC test statistic of RIDL vs. Baseline is 1.46. This demonstrates our method has valid improvements (p<0.10), which may be used and expanded upon by other works. Improved classification also lead to better diagnosis and treatment planning for patients.

R4: 6) Limited novelty Our method is the first to fuse locally computed radiomic features with low-level DNN features, and to encourage complementary features between radiomic and deep learning approaches. Unlike existing hybrid methods, our technique is the first to consider the advantages and disadvantages of both and combine them in a complementary way. 6.1) Feature selection method Feature selection methods are shown in Table 1 under “Selector” for different radiomic methods. We use LASSO regularization for feature selection in the hybrid approaches (stated in Section 3.2), because these features are more predictive. 6.2) Calculation time Local and global radiomic features can be extracted in approximately 1 and 5 seconds per volume respectively. These are only computed once without need for training. 9.1) Patch-size and feature selection Features are selected by LASSO from globally extracted features. We use patch sizes {1,2,5,10} (stated in Section 3.1). 9.2) Size of local features The size of local radiomic features is 4x96x128x128 (stated in Section 3.1). DNN model size increases only by 11%. 9.3) Reference regarding texture features We refer readers to [25] for feature descriptions. 9.4) Local and global features The motivation for using global and local features separately is that local context is not well captured by a single value. For example, the surface of a tumor may be smooth on one side and rough on another. This heterogeneity is better captured using local features instead of a global value, thus motivating their separate use. Table 3 shows results from different local features. We do not test other high-level features since global radiomic and deep feature concatenation is a standard method. 9.5) Parameters for L_rec Reconstruction is for regularization and is secondary to our method, so we do not use a separate parameter. 9.6) Reduce overfitting Pre-computed radiomic features, reconstruction, and augmentations reduce overfitting. 9.7) Dataset We only used a private dataset but will try to make it public




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Propose a novel radiomics-informed deep-learning method, RIDL, that combines the advantages of deep learning and radiomic approaches to improve AF sub-type classification. Major strength is novel ensembling strategy for using radiomics features as priors for deep learning-based classification with feature fusion and decorrelation loss + good comparison and ablation study. Rebuttal attempts to address a lot of comments, with mixed results. Some clarifications provided for reviewer comments on experimental setup. DeLong AUC test has p < 0.1, which would not commonly be considered statistically significant (unsurprising result given the marginal 1% improvement). Comment on contextualizing results has not been addressed, not clear if this performance is reasonable or an advance in the field, or comparable to previous results. Method clearly has novelty, but I’m not convinced that the current implementation demonstrates sufficient advantages at this stage.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The main strength of the paper lies in the proposal of a new way of combining radiomic features with low-level DNN features and encouraged complementary deep and radiomic features to be learned. The rebuttal has addressed well regarding the concern about its novelty and the issues with missing statistical analysis and cross-validation on limited dataset. Though evaluation is currently limited as they don’t have large database, the authors have provided a few future directions on tackling this and improve the paper. Overall, the strength of the work outweighs its limitations, and it is suggested for acceptance.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper presents a different strategy to combine radiomics and DNN features for atrial fibrilation subtyping with CT images, leading to respective performance gain (although marginal, i.e., hard to say statically significant with a p-value around 0.1) on a very small inhoused data. In the future, the authors are suggested to conduct more thorough evaluations on more dataset to comprehensively justify the efficacy and reproducibility of the proposed method.



back to top