List of Papers By topics Author List
Paper Info | Reviews | Meta-review | Author Feedback | Post-Rebuttal Meta-reviews |
Authors
Yi Hao Chan, Wei Chee Yew, Jagath C. Rajapakse
Abstract
Computational models often overfit on neuroimaging datasets (which are high-dimensional and consist of small sample sizes), resulting in poor inferences such as ungeneralisable biomarkers. One solution is to pool datasets (of similar disorders) from other sites to augment the small dataset, but such efforts have to handle variations introduced by site effects and inconsistent labelling. To overcome these issues, we propose an encoder-decoder-classifier architecture that combines semi-supervised learning with harmonisation of data across sites. The architecture is trained end-to-end via a novel multi-objective loss function. Using the architecture on multi-site fMRI datasets such as ADHD-200 and ABIDE, we obtained significant improvement on classification performance and showed how site-invariant biomarkers were disambiguated from site-specific ones. Our findings demonstrate the importance of accounting for both site effects and labelling inconsistencies when combining datasets from multiple sites to overcome the paucity of data. With the proliferation of neuroimaging research conducted on retrospectively aggregated datasets, our architecture offers a solution to handle site differences and labelling inconsistencies in such datasets. Code is available at https://github.com/SCSE-Biomedical-Computing-Group/SHRED.
Link to paper
DOI: https://link.springer.com/chapter/10.1007/978-3-031-16431-6_42
SharedIt: https://rdcu.be/cVD6Y
Link to the code repository
https://github.com/SCSE-Biomedical-Computing-Group/SHRED
Link to the dataset(s)
http://preprocessed-connectomes-project.org/abide/
http://preprocessed-connectomes-project.org/adhd200/
Reviews
Review #2
- Please describe the contribution of the paper
This paper proposes a deep-learning framework that includes data harmonisation and semi-supervised representation learning for disease classification and biomarker discovery by using data from multiple sites.
Their reported performance on two datasets is impressive compared to the existing work in the literature.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper is well written and eacy to follow by making their contributions clear.
The performances on two public datasets are improved by large margin.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The technical improvement is marginal by exploiting the existing mechanisms.
It needs to survey recent work on multi-site brain disease diagnosis thoroughly.
The reported performance is not persuasive by showing big difference from the existing work.
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
NA
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
The paper missed many recent studies on multi-site brain disease diagnosis, including population-based graph neural networks, domain adaptation, and domain generalization methods.
The proposed method applies the harmonization operation over the FC values, processed through a Pearson correlation function, rather than the raw BOLD signals. It is wondering how the performance would change when the harmonization is applied to the BOLD signals first and then to construct their respective FCs, feeding into the EDC. It would be interesting to compare them from different viewpoints.
Regarding model training, no loss term involves the site-related parameters of $\gamma_{iv}$ and $\delta_{iv}$. How can those parameters be optimized?
The authors raised an issue of inconsistency in diagnostic criteria among sites. However, the proposed method doesn’t handle the issue at all.
A critical concern about the performance is that the reported performance values are too high compared to the existing work on the same dataset, e.g., ABIDE, in the literature. In particular, the accuracy of the competing method of ASD-SAENet is approximately 10% higher than the one reported in the original paper. How could the authors explain this?
The authors raised an issue of inconsistency in diagnostic criteria among sites. However, the proposed method doesn’t handle the problem at all.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
3
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Lack of survey on the related recent work No loss term for the site-related parameters of $\gamma_{iv}$ and $\delta_{iv}$ Inconsistency for the performance of the competing method to the original work
- Number of papers in your stack
4
- What is the ranking of this paper in your review stack?
2
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
3
- [Post rebuttal] Please justify your decision
Not Answered
Review #4
- Please describe the contribution of the paper
- combined harmonization and classification framework
- demonstrates the ability to determine generalizable biomarkers
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Large multi-site cohorts
- Multiple diseases
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Effect of augmenting datasets through others not fully investigated
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Reproducibility is good
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
With the goal of helping smaller studies to be able to augment their data set, I think it is imperative to show the stability of the determined site-invariant biomarkers. A simple leave N-Sites out approach should be able to do this?
It would also be good to see how much training data was actually going into the individual models.
I also spotted a wrong highlight in Table 1 for USM. Maybe use only one number after the decimal point, as this table was pretty hard to read in some areas.
I would not call putting the code online, although extremely important, a major contribution in the introduction.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
6
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
While I asked for some additional highlights, I think this is a great paper to be discussed by the community. Data harmonization is very important and it is good to see a combined approach including classifiers.
- Number of papers in your stack
4
- What is the ranking of this paper in your review stack?
1
- Reviewer confidence
Somewhat Confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
Not Answered
- [Post rebuttal] Please justify your decision
Not Answered
Review #3
- Please describe the contribution of the paper
This paper presents a semi-supervised learning (SSL) method to harmonize data across imaging sites while simultaneously learning a classification task. The proposed variational autoencoder (VAE) model has a data harmonization encoder and decoder, with the latent representation used to learn the target classification task. The model is trained in semi-supervised way, where unlabelled data from multiple other sites are used to learn the data harmonization encoding/decoding, while labelled data is additionally used to learn the classification portion of the model. A two-step harmonization approach was also proposed, where the ComBat method was used for initial harmonization, followed by the proposed method. The method was tested against supervised learning on single sites and SSL without harmonization using the public ABIDE dataset.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
This paper aims to solve a very important problem in neuroimaging studies - data harmonization across imaging sites, which would allow for combining different datasets for more generalizable analysis and thus learning neuroimaging biomarkers that truly represent the disorder/disease under study and not specific to a single imaging site/study.
-
The novelty in this work lies in the incorporation of a linear data harmonization module in the encoder and decoder of a VAE model.
-
The authors provide a link to the code and test on a public dataset, enhancing reproducibility.
-
The experimental validation methodology is thorough, with 5-fold cross-validation performed with 10 random starts, and hyperparameters optimized using a separate validation dataset and set before running all the test experiments.
-
The paper is generally well-organized and easy to follow.
-
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
-
While the motivation for the proposed semi-supervised learning approach was that the there may be label inconsistency issues across sites. However, I’m not sure that this is a real concern in the presented ASD classification case - the diagnosis of ASD is largely objective and follows clinically well defined parameters. Still, there should certainly be other applications where label uncertainty could be an issue.
-
The proposed data harmonization model does not consider the target label (ASD/typical control). I am wondering if there is a concern then that the two neurologically different groups are being harmonized to one set of parameters? For example, perhaps activation in social motivation areas is greatly reduced in ASD subjects compared to controls, but the same alpha_v and beta_v are learned for both groups.
-
The experimental methods do not include any comparisons to other data harmonization approaches. At a minimum, I would have expected to see a comparison to a method that first runs ComBat on all the data and then uses the harmonized output to perform supervised learning for each site individually. Other potential approaches for the general domain shift problem is something like a generative adversarial network that tries to learn a representation invariant to site and other factors (e.g., [1-3]).
[1] Bashyam et al., Deep Generative Medical Image Harmonization for Improving Cross-Site Generalization in Deep Learning Predictors, 2021 [2] Gao et al., A universal intensity standardization method based on a many-to-one weak-paired cycle generative adversarial network for magnetic resonance images, 2019 [3] Liu et al, Style Transfer Using Generative Adversarial Networks for Multi-site MRI Harmonization, 2021
-
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
The authors have checked off all relevant items on the reproducibility checklist, including sharing of code, and the reporting in the paper matches the checklist, so should be highly reproducible.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
Comments are listed in order of appearance in the paper. The most major concerns are marked with (M).
-
The paper states “Unlike more complicated alternatives like ComBat, our approach of removing site differences allows biomarkers to be easily derived via computing saliency scores [10] since the implementation is based on linear layers.” The proposed generalized linear model for harmonization is the same as that used in ComBat, but ComBat estimates the parameters in a different way. Thus I am not sure what is meant by the quoted statement, as after the resting-state fMRI data is normalized in ComBat and then applied to say a classification task, feature importance could be attributed to the proper ROIs using the normalized data.
-
In the data harmonization definition in eq 1, should M_{jv} be M_{iv}? The design matrix for covariates of interest (eg. gender, age) should only depend on the specific subject (i.e. subject j at site i).
-
In eqs. 4-6, it seems that the subscript i has disappeared, but is needed to denote all the subjects from different sites.
-
In Sec 2.4 motivation for SHRED-II, where ComBat is applied before using the proposed model, the sentence “For very small datasets (< 50), we propose a two-step variant of SHRED” makes it sound like ComBat is only applied in the small dataset cases in the experiments. However, I think that the intended meaning is that the two-step variant is proposed to further improve prediction in small dataset cases. Please clarify/reword to make this clearer.
-
(M) For training of all the deep learning models, how many epochs were run / what criteria was used to determine when to stop training?
-
The loss appears to be largely dominated by the harmonization reconstruction term, with a hyperparameter that is many orders of magnitude higher than the other loss hyperparameters. I’m wondering how much of an effect the VAE parts of the loss have then, i.e., how important is the VAE modeling compared to simply the shared representation and included data harmonization?
-
(M) To reiterate the point in question 5. it would be helpful to see comparisons to other data harmonization approaches, e.g., applying ComBat and then single site training. I am also curious how the results would look if the proposed model were applied to single site data, so that there would be harmonization of age and gender factors. This could further demonstrate the advantage of being able to include more data in the semi-supervised learning approach (if this performs better).
-
For the supervised and semi-supervised (without harmonization) methods, it appears that the extra data that is used for harmonization (age, gender, site) are not included in the model. This then is not exactly a fair comparison, and goes back to the point above about needing to include other harmonization comparisons or at least compare other methods that also consider age, gender, and site (e.g., as inputs to the DNN).
-
For the ASD-SAENet results, how might the authors explain the seemingly much higher performance reported here than in the original paper proposing the approach, which also used the ABIDE individual sites for testing?
-
Table 1 is a bit hard to read - consider adding more space between columns or vertical lines to better separate values.
-
Some noted typos: p. 5 “constraints” –> constrains
-
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
5
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
To help solve the important problem of data harmonization across imaging sites, the proposed method nicely incorporates the harmonization into a VAE model that also jointly learns the target classification task. However, I have concerns regarding the assumptions of the harmonization model for handling neurologically different groups, and very importantly the paper does not present any comparisons to other data harmonization techniques.
- Number of papers in your stack
5
- What is the ranking of this paper in your review stack?
2
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
5
- [Post rebuttal] Please justify your decision
While the authors responded to many reviewer concerns, my rating stays the same (close to borderline), as I found the response to concerns about no comparisons to other harmonization approaches unsatisfying - there are domain generalization methods that could be considered, and even more straightforward would be harmonizing with combat before supervised learning, as has been done before. Still, the methods I think would be of interest.
Review #5
- Please describe the contribution of the paper
This paper proposed a semi-supervised SHRED framework based on encoder-decoder architecture to remove the site differences as well as label inconsistency for diagnosis with rs-fMRI. The SHRED improved the diagnosis accuracy and the biomarkers generated by SHRED could be site-invariant and site-specific.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
a) Data harmonisation is a very important problem for medical imaging application, for which this work proposed a feasible solution. b) Incorporating the semi-supervised learning can fully take advantage of all data. c) The results seem interesting, especially the site-invariant and site-specific biomarkers.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
a) In Table 1, the EDC-VAE have a better performance than SHRED on 9/16 sites of ABIDE data. It seem that harmonisation method proposed in this work may not help to exclude the site differences. b) The capability to address the label inconsistency was not supported by the experiments. I think add noise to label is a more appropriate way. The semi-supervised learning actually omitted the influence of inconsistency by using a part of data. c) No references for compared methods ASD-SAENet and GAE-FCNN. It seems that the two compared SL methods even perform much worse than simple DNN, which is strange. I think the authors should include some SOTA models for ADHD or ABIDE dataset. d) The ablation study should consider the influence of age and gender. For example, I think the SL based methods do not incorporate the age and gender in classification.
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
I think this work is reproducible.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
a) - d) See Q5. e) Some arguments are not rigorous. For example, in Introduction section, “Little research has been done on using SSL for neuroimaging data”. As far as I know, SSL is popular and well studied in medical imaging field in recent years. A lot of SSL papers were accepted in MICCAI. f) A lot of grammar errors. For example, their performance do –> does. g) The matrix M should be in bold font. h) In Fig. 2, SL, SHRED should be SSL, SHRED.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
5
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper works on an interesting and important problem and proposed a feasible solution. However, I think the major concerns should be addressed.
- Number of papers in your stack
7
- What is the ranking of this paper in your review stack?
3
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
5
- [Post rebuttal] Please justify your decision
Some of my concerns have been addressed.
Primary Meta-Review
- Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
3 out of 4 reviewers rate this work positively, the main strengths being
- significant improvement over baselines
- targeting an important problem (data harmonization).
However, reviewers are concerned about
- missing discussion of/comparison to recent studies on domain adaptation
- inconsistencies in the results with existent work
- lacking clarity in some aspects, e.g., the relation of the method to the issue of label inconsistency
Due to the promising results and the importance of the task, I would invite the authors to address/clarify the mentioned points in the rebuttal.
- What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
4
Author Feedback
- Comparison with recent domain adaptation studies (R2,R3,R4)
- We have looked into papers on domain adaptation but did not include them as they focus on mapping one site’s data to another site (i.e. limited to site-specific biomarkers). Instead, we focused our literature review on methods that remove site effects (e.g. ComBat) as site-invariant biomarkers, a key interest in this study, can be produced from them.
- A paper cited in the introduction [11] has shown on ABIDE that ComBat improves model performance in the SL setting. Results from [11] were not included in Table 1 as they only reported a barchart. From visual inspection, SSL + Harmonisation leads to even greater improvements than [11].
- Age, gender and label effects during harmonisation (R3,R5)
- (R5) Although age and gender were used as inputs to SHRED, the data fed into the VAE (x_res) only had site effects removed: covariates linked to age and gender (M*\beta) were actually added back (Page 3, last line). This is also consistent with [11]. Thus, this makes the comparisons between SL and SSL + Harmonisation fair as the only difference is SSL + site effects removal.
- (R3) We agree that labels could be considered during harmonisation, but this would not be possible without correcting label inconsistencies first. Furthermore, this will prevent models to be applied on unlabeled data for disease classification.
- Relation of the method to the issue of label inconsistency (R2,R3,R5)
- (R3) In ABIDE I [9], 13 sites used clinical judgement along with the gold standard, while others used gold standards or clinical judgement only. There could be differences in clinical judgement, warranting the need to deal with label inconsistency even if the diagnosis of ASD is usually objective.
- Label inconsistency was addressed via our SSL approach, i.e. drop labels when using data from other sites, thus only using them for unsupervised learning. This indeed does not directly correct the inconsistency - rather, it removes the problem by not using labels. However, such an approach has already achieved very good results on individual sites (Table 1, ~80%-90%). Better approaches could be investigated, but would likely lead to incremental improvements from the high baseline set by SHRED.
- Clarity about model training process (R2,R3)
- (R2) Although gamma and delta are not explicitly used in the loss functions, they were optimised via the backpropagation of the eps^2 term in L_R. As L_R minimise eps^2 and eps = (x - a - MB - gamma) / delta, gamma will need to be as close to x_{ijv} - a_{v} - M_{jv}B_{v} and delta need to be as large as possible.
- (R3) 1000 epochs were used for all models and test accuracy was used as the stopping criteria.
- (R3) The loss functions involve summation across all scans, which would have also considered all sites in the process, thus _i was omitted from them.
- Inconsistencies in the result (with existing work) (R2,R3,R5)
- The difference between results reported in ASD-SAENet [2] and our implementation (using their code and parameters) is likely due to the different atlases used. In [2], 200 ROIs from the Craddock atlas were used, while we used 264 ROIs from the Power atlas. Better performance in our implementation of [2] could suggest that ROIs from Power atlas are more informative for ASD classification and biomarker discovery.
- (R5) SHRED, being end-to-end trainable, works well on bigger labelled datasets while SHRED-II works well for small ones. For these 9/16 cases, the initial amount of data is very small and the results (SHRED-II is better than EDC-VAE in most of these cases) suggest that SHRED-II should be used in such cases.
- (R2,R5) The large difference in performance between SL and SSL could be due to the size of data involved. SL settings mostly have very little data (<50, shown in Fig S1) but in SSL, we used data from all sites (>500 samples). The significantly better results from using SSL + Harmonisation demonstrate its value.
Post-rebuttal Meta-Reviews
Meta-review # 1 (Primary)
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
Even though the rebuttal could have been more extensive I concur with the majority of the reviewers in favor of acceptance of this paper.
- After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.
Accept
- What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
5
Meta-review #2
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
Reviewers agree that this paper addresses an important topic, and the overall feedback was positive to begin with. The rebuttal provides detailed answers on the technical questions that were raised. A remaining concern is the somewhat limited scope of the literature survey, but I would argue that a limitation to the most important references is an unavoidable consequence of the conference paper format and MICCAI page limit.
- After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.
Accept
- What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
7
Meta-review #3
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The proposed harmonization method presents novelty and contribution to the related problem. In the meanwhile, the key idea, a semi-supervised way to address the heterogeneous diagnostic attributes across datasets, should be further clarified in the final manuscript. Discussion on the comparisons needs improvement as well; for example, having different atlases and cohorts would not make up for the large discrepancies in the performance comparison.
- After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.
Accept
- What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
9