Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Eike Petersen, Aasa Feragen, Maria Luise da Costa Zemsch, Anders Henriksen, Oskar Eiler Wiese Christensen, Melanie Ganz

Abstract

Convolutional neural networks have enabled significant improvements in medical image-based diagnosis. It is, however, increasingly clear that these models are susceptible to performance degradation when facing spurious correlations and dataset shift, leading, e.g., to underperformance on underrepresented patient groups. In this paper, we compare two classification schemes on the ADNI MRI dataset: a simple logistic regression model using manually selected volumetric features, and a convolutional neural network trained on 3D MRI data. We assess the robustness of the trained models in the face of varying dataset splits, training set sex composition, and stage of disease. In contrast to earlier work in other imaging modalities, we do not observe a clear pattern of improved model performance for the majority group in the training dataset. Instead, while logistic regression is fully robust to dataset composition, we find that CNN performance is generally improved for both male and female subjects when including more female subjects in the training dataset. We hypothesize that this might be due to inherent differences in the pathology of the two sexes. Moreover, in our analysis, the logistic regression model outperforms the 3D CNN, emphasizing the utility of manual feature specification based on prior knowledge, and the need for more robust automatic feature selection.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16431-6_9

SharedIt: https://rdcu.be/cVD4Q

Link to the code repository

https://github.com/e-pet/adni-bias

Link to the dataset(s)

https://adni.loni.usc.edu/


Reviews

Review #1

  • Please describe the contribution of the paper

    The author proposed to evaluate the robustness of automatic feature extraction method based on a 3D CNN compared to a logistic regression method for Alzheimer’s disease classification. Authors showed that

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well-written
    • Hyptheses are clearly stated and tested
    • Findings show no sex-dependency
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper could provides a more comprehensive analysis of possible pit-fall for deep-learning methods.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper provides enough details to enable reproducibility of the experiments.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Why did the authors restrict their analysis to sex dependency only? It would be interesting to see how age-dependency plays a role in such a framework.
    • The paper would benefit for a more exhaustive analysis.
    • The author claims that deep-learning reach lower performance compared to the hand-crafter feature and logistic regression method. However, the method used have been proposed in a pre-print submitted recently (February 15, 2022) – Without saying that given the date this probably goes against the double-blinding submission process. The performance of this network is below current state-of-the-art for AD classification, therefore all the discussion might be not adapted for more advanced deep-learning methods.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Experiments and results detailed in this paper are not enough to guarant presentation at MICCAI.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    In this paper, the authors analyze the robustness of two different MRI volume-based classifiers to distribution shifts. Both classifiers are trained to detect Alzheimer’s disease, based on different feature representations. The first is a logistic regression model that uses manually selected volumetric features as inputs, which are obtained using FreeSurfer and SPM. The second is a CNN using the full 3D MRI volumes as inputs. They analyze the effect of differing training dataset sex compositions on the performance for male and female test subjects.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Analysis by gender in the classification of subjects with AD/HC, pMCI/sMCI.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The volume of brain substructures can vary subject to subject, it is not considered a good feature.
    • During CNN training, the dimensions are normalized, which can compromise the shape of the brain.
    • They do not mention anything about age, it is important in the study of dementia
    • CNN and logistic regression are already widely used algorithms.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The database is known http://www.adni-info.org/ , they added implementation details and the code is available on github.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    -I suggest adding the main quantitative results to the abstract

    • Denote MRI and HC before table 1
    • Review the format of table 1
    • The validation and final test sets are not clear, I suggest adding a table with the subsets used in each validation instead of figure 2.
    • I recommend considering the age of the study subjects as a covariate. Brain morphology can change according to age.
    • One of the challenges in the classification of AD or MCI is the automatic classification between MCI/HC, justifying why this test was not performed.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    2

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • CNN and logistic regression are already widely used algorithms. There is no contribution regarding the method.
    • Age is not considered, which is important.
    • One of the main challenges is the classification between MCI and HC, which was not carried out.
  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #4

  • Please describe the contribution of the paper

    In this work the authors evaluate the effects of sex imbalances on the performance of two classifiers (logistic regression and CNN) in the context of Alzheimer’s disease diagnosis/prognosis. They trained the two classifiers with several training sets more or less imbalanced and showed that, in contrast with other domains such as lung disease diagnosis from x-ray, sex differences do not seem to affect the results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The problematic is important as it has been shown in other contexts that sex differences can bias classifiers and many works focus on Alzheimer’s disease computer-aided diagnosis/prognosis.
    • The experimental setup seems sound and well designed to answer the proposed question.
    • The results seem solid and well explained.
    • The paper is clear and easy to follow.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The work focuses on ADNI, a database that has been extensively used and includes peculiarities, such as the fact that there are more men than women while the prevalence of AD is generally higher for women than men. Extending the proposed work to other data sets would strengthen its impact.
    • The discussion/conclusion could remind that the training task is limited to AD vs CN and that the conclusion reached might not hold when the training task is pMCI vs sMCI (for example because women tend to progress faster than men).
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The answers given by the authors seem consistent with what is present in the paper. Code will be made available on GitHub, and the methods and results are well described for an 8-page paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • The authors could mention why they mix FreeSurfer and SPM to obtain regional volumes normalised to ICV (I assume this is related to the better consistency in ICV estimation with SPM?).
    • Could the authors explain why the classification performance is higher for women than men, even when no women were part of the training set?
    • Even though nothing is significant, it seems that the CNN is more affected by sex differences than the logistic regression, could you comment?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Well designed and described study with an important message for a community that extensively works on the computer-aided diagnosis/progonosis of AD with ML/DL.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    The contribution of this paper is not to be found in a novel methodology but in the analysis of the limitations of widely-used methods and data sets. This message is important and should be heard by the MICCAI community.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The reviewer comments are quite mixed, so I recommend a rebuttal.

    Strenghts: Important problem of sex bias, good experiment set-up, solid and well-explained results, well-written paper.

    Weaknesses: Limited methodological contribution, it is not clear how/if the authors accounted for age, only one dataset (ADNI) is used, limited task (only CN/AD).

    Please address in the rebuttal: the methodological novelty of the paper, the role of age bias, and how results would extend to other tasks and datasets.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7




Author Feedback

We thank the reviewers for their careful assessment and valuable feedback. The reviewers appreciate the critical issue of potential biases resulting from dataset composition and assess our experimental setup as solid.

=== Novelty and methodological contribution === The main objective of our work is to investigate the relationship between dataset composition and the performance of the resulting trained classifiers. This relationship has recently received much attention, with many studies reporting improved performance on groups that are over-represented in the training dataset. In our study, as opposed to previous studies on, e.g., chest x-ray disease classification, we do not find a strong effect of dataset composition on the performance of the resulting classifiers. This result is especially relevant since the dataset we examine is one of the most widely used public datasets in the MICCAI community.

A key message of our work is that the link between dataset representation and classifier performance on different groups is not universal and depends on the specific prediction task and feature representation at hand. With our contribution, we aim to raise awareness of this issue in the medical imaging community and spur further research towards deeper theoretical understanding and practical solution approaches.

=== Datasets === In addition to the singular importance of ADNI for the MICCAI community, our paper explicitly seeks to compare the sensitivity of the very flexible CNNs to the far more robust logistic regression model. Alzheimer’s disease is a great example as both types of models are currently considered state of the art. This does not hold for most other medical imaging tasks, and among Alzheimer’s disease databases, ADNI gives an advantage in its size.

=== Age effect === The reviewers’ question concerning the role of age in our analysis is very valid, which is why we had included age as a covariate in our logistic regression model (Eq. 1). To further investigate the role of age, we repeated the experiment described in our paper, but grouping subjects based on age (above or below the median age 73 years) instead of based on sex. In this initial experiment, we indeed observe a small but statistically significant effect of dataset representation on classifier performance, and we are happy to include these additional supplementary results in the final manuscript.

This result lends further support to our key message that the influence of dataset representation on classifier performance depends on the specific groups and tasks under consideration. While classifiers trained on different sexes transfer well to the other group, classifiers trained on older subjects perform worse when used in younger subjects.

=== Are our methods state of the art? === Wen et al. (2020) provide an extensive review of the state of the art in MRI-based AD detection. Our CNN architecture is very similar to theirs and the one proposed by Tinauer et al. (2022), and all our processing steps (including registration to atlas space) are standard in the field of neuroimaging. Our test set performance (mean CNN accuracy 0.80/0.78 for females/males) is within the range reported by Wen et al. for studies without suspected data leakage. The (valid) studies that report higher performance use significantly more recordings or multiple modalities, and they only report the performance of a single training run. (Some of our runs also performed significantly better than the mean accuracy reported above, cf. Fig 3.) Thus, we are confident that our analyses can provide insights into the properties of state-of-the-art methods.

=== Preprint === We do not have a February 15 preprint, as suggested by one reviewer. Note, however, that arxiv preprints are allowed by the submission guidelines.

=== Further comments === We thank the reviewers for their detailed further comments, which we will also address in the final manuscript.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I agree with Reviewer 4 that the contrubution of this paper is in analysis of the limitations of widely-used methods and data sets, and that this is an important message. Unfortunately the negative reviewer have not updated the review after the rebuttal. There comments are well-addressed and the novel contribution of the work is clear.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Overall, an important, refreshing report to read. Adding the results of the age-related experiment will further enhance the impact of the paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Important problem, all in all I follow the authors in their rebuttal.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    uper mid-range



back to top