Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Matthew MacPherson, Keerthini Muthuswamy, Ashik Amlani, Charles Hutchinson, Vicky Goh, Giovanni Montana

Abstract

Understanding the internal physiological changes accompanying the aging process is an important aspect of medical image interpretation, with the expected changes acting as a baseline when reporting abnormal findings. Deep learning has recently been demonstrated to allow the accurate estimation of patient age from chest X-rays, and shows potential as a health indicator and mortality predictor. In this paper we present a novel comparative study of the relative performance of radiologists versus state-of-the-art deep learning models on two tasks: (a) patient age estimation from a single chest X-ray, and (b) ranking of two time-separated images of the same patient by age. We train our models with a heterogeneous database of 1.8M chest X-rays with ground truth patient ages and investigate the limitations on model accuracy imposed by limited training data and image resolution, and demonstrate generalisation performance on public data. To explore the large performance gap between the models and humans on these age-prediction tasks compared with other radiological reporting tasks seen in the literature, we incorporate our age prediction model into a conditional Generative Adversarial Network (cGAN) allowing visualisation of the semantic features identified by the prediction model as significant to age prediction, comparing the identified features with those relied on by clinicians.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_25

SharedIt: https://rdcu.be/cVRU5

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

In this work the authors present a study comparing the performance of three radiologists on chest X-ray age prediction and ranking tasks, comparing with data-driven models trained on a highly heterogeneous non-curated set of chest X-rays from a variety of clinical settings in six hospitals. The authors conclude that (a) the radiologists are significantly more accurate at detecting age-related changes in a single patient than at estimating age in single images and (b) the models significantly outperform humans on both tasks. The author’work indicates that accuracy gains are likely to be small from larger datasets, and that the majority of age-relevant information is present at resolution in this modality.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1.The authors demonstrate a GAN-based ’explainable AI’solution to visualise age-relevant features identified by the model, comparing with those identified by the radiologists based on their clinical experience. This paper is relatively experimental on demonstrating the effectiveness of each components of proposed method. 2.The author use GAN-generated synthetic chest X-rays conditioned on patient age to intuitively visualise age-relevant features via simulated age progression.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1.In Section 2.2, the paper says “whether the ability of radiologists to order two images of the same patient by age is superior to their ability to estimate true patient age from a single image”. But the experiment to compare these two abilities was not introduced and What is the final conclusion. 2.In Figure 1(b), it is not clear how the pixel difference map by the model is implemented. 3.The presentation of the purpose of separating aging features is insufficient and unclear. 4.The expectation of ranking success rate for humans is 59.7%±2.8% in Section 3.2, What is the basis of the ranking success rate for humans? 5.In the model architectures comparison, the MAE of 3.33 from Efficient+LR just only 1% improvement compared with Efficient+CL and Efficient+OR in Table 1. A minor improvement is not enough to indicate that Efficient+LR network is optimal. Which method did you ultimately choose for Age Prediction?

Minor 1.Figure 3 (right) and its description in the Generalisation Performance on Public Datasets of the Section 3.1 is ambiguous. 2.98 years refer to your own dataset results or after fine-tuning on the Chest14 training data.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Maybe the results can be reproductive.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Please refer to the weakness.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method is motivated and the results are convince, expect for some issues. Currently, I recommend to accept and expect the replies in the rebuttal.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The authors present a framework aimed to determine patient age from a chest x-ray. The model was trained on a large database (1.8M chest X-rays). An ablation study investigating the model accuracy based on changing training size and image resolution demonstrates the generalizability of the approach. The approach is based on a conditional Generative Adversarial Network and allows the visualisation of the predicted scan that can be used to identify semantic features learned by the model.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

An impressive large dataset is used in this work.

I found the ablation study and the comparison with radiologists (used to validate the proposed work) quite interesting.

The use of the predictive model to visualize semantic features is quite relevant and this solution can be used to discover new potential biomarkers.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The paper should be improved in clarity and has different grammatical errors/typos.

Some of the results are not described adequately and the results section must be reorganized/rewritten.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Details required to train the network are missing. This includes some important hyperparameters such as the learning rate, batch size, etc.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Some of the results are not clear and not well described. Below are my main concerns:

1) Please define some of the metrics used during evaluation (i.e. MAE ME, and R2)

2) Some of the results mentioned in the text are missing from the tables. For example, a) “…CNN is saturated with this size of dataset and the specific modeling approach is only marginally relevant. Using DenseNet- 169 CNN (14.1M parameters) in place of EfficientNet-B3 (12M parameters) led to a slightly lower performance (3.40 vs 3.33)…” b) “….Taking a mean ensemble estimate from models trained at the four resolution levels reduces MAE to 2.78 years, which to the best of our knowledge is the best accuracy reported in a heterogeneous non-curated dataset….”

3) Some sentences on the results are very confusing. For example what the following sentence means? Please rewrite it: We observe actual success rates of 67.1% for the humans and 82.5% and 85.5% for the regression and ranking models; with respective p-values of 0.001, 0.201 and 0.029 we find that the radiologists significantly exceed the baseline expectations, no evidence that the regression model outperforms (as expected) and weak significance for the ranking model exceeding the baseline regression expectation.

Why in Fig. 4 the “model incorrect” results are not provided?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper uses an impressive large dataset and some of the presented results are interesting. However, the clarity of the paper must be improved and some of the results are not reported correctly in figures and tables.

The description of the framework should also be extended so that relevant hyperparameters required for training the system are correctly described.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

In this study the authors present a chest X-ray age prediction model trained on a large heterogeneous set of chest X-rays, with sensitivity analysis to training set size and image resolution, and generalisation performance on the public dataset NIH ChestX-ray[14]. Moreover they present a study comparing the performance of human radiologists against their model on two tasks: (a) ground truth age prediction from a chest X-ray; (b) ranking two time-separated scans of the same patient in age order.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The study is very interested. Well written organised. They deliver and answer their hypothesis questions. There is a novel idea of the network and how to combine GAN with regression models to predict the age of a patient from X-ray.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

My only concern is in the human AI comparison.

The authors used test cohort for this which may include images from the six hospitals protocols whic the AI network already know through training. This si not a fair comparison with human experts. The authors need to compare in a cohort where the AI tool never show before the hospital’s protocol modality resolution and quality of the images. By this way the comparison is fair with the human experts.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

easily reproducible
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Very nice study and idea. Well design and delivered.

My only concern is in the human AI comparison.

The authors used test cohort for this which may include images from the six hospitals protocols whic the AI network already know through training. This si not a fair comparison with human experts. The authors need to compare in a cohort where the AI tool never show before the hospital’s protocol modality resolution and quality of the images. By this way the comparison is fair with the human experts.

Please kindly verify if there was images from the six hospital in the test cohort which already excisted in the training cohoirt of AI. If yes please remove these images and test again in the new test cohort and observe the differences with the experts. Thank you.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The study is very interested. Well written organised. They deliver and answer their hypothesis questions. There is a novel idea of the network and how to combine GAN with regression models to predict the age of a patient from X-ray.

My only concern is in the human AI comparison.

The authors used test cohort for this which may include images from the six hospitals protocols whic the AI network already know through training. This si not a fair comparison with human experts. The authors need to compare in a cohort where the AI tool never show before the hospital’s protocol modality resolution and quality of the images. By this way the comparison is fair with the human experts.

Please kindly answer the above question.
Number of papers in your stack

8
What is the ranking of this paper in your review stack?

4
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
Summary & Contribution: This work presents a comparative study on age prediction models from chest X-rays. They compare radiologist with deep learning models on two tasks: age estimation and ranking of two image in time. The main motivation of this work is that understanding aging features is important for an accurate and informative diagnostic. The authors trained the models on 1.8M chest X-rays and compare the results with those from clinicians, including the image features that both (model and clinicians) used to make a prediction. The main conclusion is that models significantly outperform humans on both tasks.

The main contribution of this work is the proposed model for age prediction with good generalisation on a public dataset, and the use of GAN to visualise age-relevant features.

Key strengths:
- The comparison of a GAN-based explainable AI-solutions and the comparison with radiologist is sound
- Large dataset used in the experiments
- Strong ablation and comparison study
Key weaknesses:
- Ideally, a new test dataset from another hospital should have been used to compare the AI models and radiologists.
- Clarity of the paper could be improved, including grammatical errors and typos.
Evaluation & Justification: Reviewers agree that this is a strong and interesting study comparing the performance of AI with radiologists using an impressive dataset. The idea of using GAN to visualise age-relevant features and compare them with the features used by radiologists is considered to be sound.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

1

Author Feedback

Dear Reviewers & Chairs,

We are grateful to the reviewers for their time and for the thoughtful comments provided, and we will aim to address their concerns within the revision period. Specific points to address from the reviews are:

Clarity of presentation: Reviewer 1 comments that the GAN-produced feature identification is insufficiently motivated and explained, and Reviewers 1&3 found the analysis of the results from the ‘age ranking’ task to be poorly explained. We will endeavour to improve the presentations of these sections subject to the available space constraints. We apologise for the grammatical errors and lack of clarity reported, and will address the specific examples highlighted (e.g., Reviewer 3 section 6) and perform a thorough repeat proof reading.

Additional training information: Reviewer 3 requests additional information on the model hyperparameters used for reproducibility. This information was omitted from the paper draft given the space constraints, but we will include the requested information within the allowable extension.

Use of external data sets: Reviewer 4 expresses concern that images from the same hospitals used to train the models were used in the test data for the human/AI comparison study, potentially flattering the model performance due to commonalities in the imaging protocols etc. vs an external dataset. In the age prediction model results we demonstrated generalisation performance on an external dataset (Chest14), giving us confidence that the model performance would be similar on data from new hospitals. We also consider the use of images from our dataset to be justified in this case since the patient population represented is large and heterogeneous (in terms of patient demographics, scanner types, clinical settings and image acquisition settings), and our image pre-processing pipeline directly from raw DICOM eliminates variations in image storage protocols, giving us confidence in our conclusions. Furthermore, the radiologists involved in the study were recruited from those same hospitals and hence are also familiar with any potential hospital-specific factors, giving a fairer comparison. Given the generalisation results provided and the strength of the model outperformance on the study tasks vs the radiologists, we do not believe that our conclusions would be significantly impacted by using a completely different dataset, although a performance gap due to image quality variation is possible in theory. We appreciate this point however, and in the future a larger study including external dataset will provide support in follow-up work.

back to top

Assessing the Performance of Automated Prediction and Ranking of Patient Age from Chest X-rays Against Clinicians