List of Papers By topics Author List
Paper Info | Reviews | Meta-review | Author Feedback | Post-Rebuttal Meta-reviews |
Authors
Leonie Henschel, David Kügler, Derek S Andrews, Christine W Nordahl, Martin Reuter
Abstract
Exploration of bias has significant impact on the transparency and applicability of deep learning pipelines in medical settings, yet is so far woefully understudied.
In this paper, we consider two separate groups for which training data is only available at differing image resolutions. For group H, available images and labels are at the preferred high resolution while for group L only deprecated lower resolution data exist. We analyse how this resolution-bias in the data distribution propagates to systematically biased predictions for group L at higher resolutions. Our results demonstrate that single-resolution training settings result in significant loss of volumetric group differences that translate to erroneous segmentations as measured by DSC and subsequent classification failures on the low resolution group. We further explore how training data across resolutions can be used to combat this systematic bias. Specifically, we investigate the effect of image resampling, scale augmentation and resolution independence and demonstrate that biases can effectively be reduced with multi-resolution approaches.
Link to paper
DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_34
SharedIt: https://rdcu.be/cVRyO
Link to the code repository
https://github.com/Deep-MI/FastSurfer
Link to the dataset(s)
N/A
Reviews
Review #1
- Please describe the contribution of the paper
This paper described the analysis for resolution-bias in the segmentation network and explored the way to reduce the bias. In order to analyze the problem of the limitation of the segmentation network performance due to the resolution of the training data, a comparison experiment was performed using four approaches. This paper shows that the single resolution network fails to generalize across resolutions, but scale augmentation and network with resolution independence structure are helpful to reduce the bias.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Experiments and analysis have been done well on the direction of the problem and solution caused by the resolution bias. The contents of the experiment and the analysis also seem appropriate. This work is expected to provide useful information in the field of deep learning-based segmentation research.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
For further analysis, two additional approaches (drop Group H, and downscale Group H) could be helpful to examine the systemic bias due to image resolution.
- Please rate the clarity and organization of this paper
Excellent
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Not applicable
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
Additional experiments (drop Group H, and downscale Group H) could be helpful to examine the systemic bias due to image resolution.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
7
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Experiments and analysis have been done well.
- Number of papers in your stack
4
- What is the ranking of this paper in your review stack?
1
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #4
- Please describe the contribution of the paper
This work illustrates how the resolution bias in the data distribution propagates to the output prediction. Their compare how different strategies such as input resampling, scaling augmentation perform in comparison with resolution-aware architectures, and demonstrate that the later approaches reduce such bias more effectively.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Strengths
-
This paper explores a very relevant problem of image resolution bias in DL segmentation
-
Usage of publicly available datasets and assessment on two different tasks (cortical segmentation in adults and children and hippocampus segmentation)
-
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
-
Unclear value of the down-stream task classification
-
Statistical significance of the results
-
2.5D approach
-
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
The statement of the authors for reproducibility does not correspond with the paper. There is no mention of code availability, nor information of time/cost for training. Neither hyper parameter optimisation setting. It does not seem that this study can be reproduced due to all this lack of details. Methods are very vaguely described.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
The authors illustrate the problem of image resolution bias in two different tasks. I’ve found the paper extremely well written, and portrays an important message for the MIC community. The paper though lack of details for methods and imaging setting used.
My concert regards this work relates to the first experiment. Which is the rationale of mixing adult and children’s data for cortical segmentation? It is not specificized which image contrast is used for that task. Children of below 9 months may have quite different MR contrast (T1 and T2 are swapped).
Why not explore the issue on either children or adults only?
I think there is an error in the 80 scans split (adults children) as 50/10/30 is 90?
On the top of scale augmentations, which other augmentations were used in the scenario? Gaussian noise?
Which was the exact interpolation used for method c) ? Would super-resolution techniques instead of interpolation do a better job?
I acknowledge the value of adding an additional downstream task. But is not clear to me the rationale for the task in classifying adult vs children based on cortical GM. Were those measures normalized from intra-craneal volume? It would have been more interesting to explore this task on patients vs healthy controls population for instance (AD vs controls).
I did not understood why adults overall seems to not be affected by the traiing/network strategy in the results, this resolution bias is then just and issue according to the targeted volume to be segmented?
I do not understand how in the classification task Table 1, first column, suddently classification improves up to 0.89 at the lowest resolution. But overall I’ve found this task not so pertinent to explore how bias propagates further.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
5
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Overall, i’ve found the paper and the topic very interesting and i think deserves to be discussed during the conference. However the lack of information on methodology and most importantly the choice of some experiments could be improved. I understand though the limited space.
- Number of papers in your stack
5
- What is the ranking of this paper in your review stack?
2
- Reviewer confidence
Confident but not absolutely certain
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #5
- Please describe the contribution of the paper
The authors present an analysis of volumetric bias from networks trained on image data acquired at one resolution influencing segmentation results on data acquired at a second resolution. To measure the volume bias, a normalized volume difference is computed between the ground truth segmentation and the predicted segmentation mask. For models and data, the authors test Unets, and voxel size independent neural networks (VNNs) trained with a variety of resampling on MRI images acquired from adults and children to provide a resolution challenge, and hippocampal volume segmentation. Volume bias is examined by looking at the distribution of volumes segmented, Dice similarity, and the volume bias of the predictions. Overall, the authors found that building scale-invariant networks, or using resolution augmentation can reduce volume biases.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The major strength of this paper is using the volume bias to check for the generalizability of trained models from data acquired at one resolution to data acquired at another. This metric provides a better method of investigating differences resulting from scale variance, and can highlight which augmentation and network setup leads to the least bias.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The methods presented are not novel and have been used before. The models used are all presented in other papers cited by the authors. The use of a volumetric bias in adolescent and hippocampal brain segmentation evaluation has been performed before by Herten et al. in their 2018 paper accuracy and bias of automatic hippocampal segmentation in children and adolescents.
- Please rate the clarity and organization of this paper
Excellent
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
The author’s results should be easily reproducible as their models are taken from other papers, and their datasets are accessible elsewhere. To improve reproducibility the authors should make the images in each of their splits available.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
-
When training your models, I would have implemented early stopping to finish training rather than the fixed number of epochs the authors used. This could lead to either underfitting or overfitting. This is especially important as the authors used different architectures for their models.
-
Would be beneficial to include visual results of the model performance (and bias on low-res data)
-
It would be nice to assess the correlations between Dice, Hausdorff etc with volume bias
-
There are several available open-source (SOTA) CNN-based hippocampal segmentation models, the paper and the field in general would strongly benefit from comparing them to your augmentation or VINN models.
-
The figure labels are a bit crowded
-
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
5
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The main reason for recommending this paper is that this application of existing bias measurement methods to scale-invariant models, and models trained with scale augmentation provides an interesting way of looking at systemic errors. Using volumetric bias to evaluate generalizability and awareness of systemic errors will vastly improve confidence in how generalizable a given model is.
- Number of papers in your stack
5
- What is the ranking of this paper in your review stack?
3
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Primary Meta-Review
- Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
This paper investigates volumetric bias in deep learning based image segmentation. The reviewers agreed on the relevance, importance and the quality of the work. Please take the comments into account in finalizing the manuscript.
- What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
1
Author Feedback
We thank the reviewers for the detailed comments and want to clarify some concerns and open questions in this response. Overall, reviewers agreed, that the paper is written and organized very well. To strengthen the organization and objective of the paper, we chose to focus the paper on one specific question: “How do biased resolutions during training affect the decisions of deep-learning-based (neuro-)segmentation, especially can we transfer low-res annotations/images into high-res target applications?” In consequence, alternative SOTA implementation for the different tasks (Hippocampus and cortex segmentation) were out of scope and we focused on one whole brain SOTA architecture (Henschel et al., 2020, Henschel and Kuegler et al., 2022). Our priority was to show our findings generalize across multiple problems. We address reviewer’s questions and improvement suggestions in the following.
Value of the Classification Task Volumetric estimates are often chosen as independent variables of statistical models (e.g. Gabery et al., 2015, Potvin et al., 2016, Baglivo et al., 2018, Vinke et al., 2019, Lombardi et al., 2020). If such independent variables are biased due to the training scheme, this bias might propagate into the statistical model and influence study findings. The classification model chosen as a downstream task confirms, that the training bias does in fact propagate into these downstream tasks and indicates that this effect is reduced, or rather undetectable for VINN and CNN+scale augmentation.
Statistical Significance Testing Significance tests are unfortunately not applicable due to small dataset sizes (n=15 and n=10 per group in the two tasks) resulting from the limited availability of manual labels combined with small effects.
Effect on Adults Adults are not affected because the images reside at the original high-resolution, present during network training.
Early stopping We did use early stopping and clarified the point in the method (best epoch selected for each network based on the validation set).
Correlations between Dice, Hausdorff etc with Volume Bias Dice, Hausdorff, etc. and volume bias are independent, they do not correlate (see Figure 1). Standard metrics are first order moments, while bias is a (central) second order moment of the expectation. First order moments cannot differentiate between random and systematic errors (this is the defining difference between these metrics).
Several other comments were taken into consideration to improve the presentation and clarity of the paper, once again: “Thank you for the feedback!”
Citations
Gabery et al., “Volumetric analysis of the hypothalamus in Huntington Disease using 3T MRI: the IMAGE-HD Study.” PloS one, 10,2 e0117593, 2015
Potvin et al., “Alzheimer’s Disease Neuroimaging Initiative. Normative data for subcortical regional volumes over the lifetime of the adult human brain”, Neuroimage, 137:9-20, 2016
Baglivo et al., “Hippocampal Subfield Volumes in Patients With First-Episode Psychosis.” Schizophrenia bulletin vol. 44,3: 552-559, 2018
Vinke et al., “Alzheimer’s Disease Neuroimaging Initiative. Normative brain volumetry derived from different reference populations: impact on single-subject diagnostic assessment in dementia”, Neurobiol Aging, 84:9-16, 2019
Lombardi et al., “Structural magnetic resonance imaging for the early diagnosis of dementia due to Alzheimer’s disease in people with mild cognitive impairment”, Cochrane Database Syst Rev, 3(3):CD009628, 2020