Authors

DongAo Ma, Jiaxuan Pang, Michael B. Gotway, Jianming Liang

Abstract

Deep learning nowadays offers expert-level and sometimes even super-expert-level performance, but achieving such performance demands massive annotated data for training (e.g., Google’s proprietary CXR Foundation Model (CXR-FM) was trained on 821,544 labeled and mostly private chest X-rays (CXRs)). Numerous datasets are publicly available in medical imaging but individually small and heterogeneous in expert labels. We envision a powerful and robust foundation model that can be trained by aggregating numerous small public datasets. To realize this vision, we have developed Ark, a framework that accrues and reuses knowledge from heterogeneous expert annotations in various datasets. As a proof of concept, we have trained two Ark models on 335,484 and 704,363 CXRs, respectively, by merging several datasets including ChestX-ray14, CheXpert, MIMIC-II, and VinDr-CXR, evaluated them on a wide range of imaging tasks covering both classification and segmentation via fine-tuning, linear-probing, and gender-bias analysis, and demonstrated our Ark’s superior and robust performance over the state-of-the-art (SOTA) fully/self-supervised baselines and Google’s proprietary CXR-FM. This enhanced performance is attributed to our simple yet powerful observation that aggregating numerous public datasets diversifies patient populations and accrues knowledge from diverse experts, yielding unprecedented performance yet saving annotation cost. With all codes and pretrained models released at GitHub.com/JLiangLab/Ark, we hope that Ark exerts an important impact on open science, as accruing and reusing knowledge from expert annotations in public datasets can potentially surpass the performance of proprietary models trained on unusually large data, inspiring many more researchers worldwide to share codes and datasets to build open foundation models, accelerate open science, and democratize deep learning for medical imaging.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_62

SharedIt: https://rdcu.be/dnwdJ

Link to the code repository

GitHub.com/JLiangLab/Ark

Link to the dataset(s)

N/A

Reviews

Review #3

Please describe the contribution of the paper

ARK addresses the primary problem affecting ML in medical imaging - the lack of good data. It is a teacher-student network with multi-task heads, which allows it to learn from heterogeneous small datasets with differing expert labels, then transfer to the target task. They demonstrate the approach on chest X-rays, obtaining state-of-the-art results, and conduct extensive analysis into how their approach affects generalisation and reduces bias.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The approach is a novel solution to one of the largest problems in medical imaging ML. Tht paper is clear, well written, has a convincing experiment, and detailed analysis of the effect of their approach on generalisation and bias. They have taken a number of different approaches and combined them in an elegant way.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The paper ovelwhelming presents the approach’s successes. There are undoubtably failure cases. It would be interesting to present some of these and to examine them in more detail. This would increase the confidence in their approach, and provide a good example to others.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper seems to be reproducible according to the checklist.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

I enjoyed reading your paper. My only feedback is related to the weakness listed above. I think it’s important that as a field we are open about any failures. It’s possible all of your experiments were a success, but if there were any failures it would be great for others to include them too.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

8
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a novel approach to a significant problem in the field, shows state-of-the-art results, and provides good analysis of effects on generalisability and bias.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

The paper hypothesizes that a model which is trained on multiple small datasets is more robust hence, proposes a student-teacher framework that reuses knowledge via learning from multiple heterogeneous datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper addresses a very key and practical problem i.e., lack of clinical data.
- The authors have compared their proposed method with different models as well as different downstream tasks to demonstrate the robustness and power of the representations learned by their method.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The authors list data privacy as the only difference between federated learning (FL) framework and their proposed method. However, this is a rather weak argument regarding the core of their method. Its true that privacy is one of the major concerns of FL but ultimately an FL model is learning from multiple different sources, which is similar to what the authors propose.
- The proposed method claims to train a multitask model in order to learn from different datasets. However, the datasets seem to have a lot in common. This raises the issue that whether or not different datasets are basically multiple augmentations of the same input with a rather moderate noise in the targets. This is due to label variability. The authors need to investigate the level of heterogeneity w.r.t the performance.
- The paper does not provide sufficient detail regarding training the models. From algorithm 1, it seems that the classifiers are binary. if that’s the case, it means ARK is trying to do multi-class classification with an ensemble of binary classifiers.
- The paper states that ARK has outperformed SOTA. However, I wonder whether or not they augmented the training data for the existing methods.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper states that the code and the models will be publicly available after publication. Moreover, the datasets are public.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

I would recommend the authors do a fair comparative study. This includes having augmentation rounds for other methods. I would like to see some discussion regarding using ensemble of binary classifiers. This is a well studied area in the literature. As the authors have mentioned in the future work section, I think they should include multiple different modalities in their datasets. Finally, I would like to see discussions about the level of heterogeneity and its correlation with the performance.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

3
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Limited technical novelty Lack of fair comparison No arguments regarding level of heterogeneity vs model performance.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

This work proposed a scheme named Ark which utilized several public datasets to develop a deep learning model with multi-task head. Using a student-teacher model, this Ark can learning heterogenous labels from different datasets. Author validate the performance of Ark on 5 classification tasks and 5segmentation tasks.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1.Novelty: compared to previous studies that analyzing different datasets, this work used a semi-supervised learning scheme to extract knowledge prototypes from different public datasets.
1. Thorough comparision: during the test process, this study not only evaluate models on seen datasets, but also on one unseen dataset. This indicated the generalization of this study.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. In result4, the Fig3 showed the comparision between Ark and CXR-FM on two datasets. I am wondering how about the results of the other four datasets, and is there a possible way to include them with these two together. Otherwise, it might cause confuse for readers.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Code and data are available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Overall, this is a well-designed work to develop an Ark framework to extracting imaging knowledge from multiple datasets. This work certainly has potential to deal with the lack of annotated data in medical field.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

8
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major strength includes the ability to deal with data scarcity in medical field, and the potential of Ark framework to learn different domain knowledge.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This work proposed a method named Ark which utilized several public datasets to develop a deep learning model with a multi-task head. Using a student-teacher model, this method can learn heterogeneous labels from different datasets. The paper demonstrates the performance of the method on five classification tasks and five segmentation tasks. The paper provides extensive results and generally of a very high quality.

Please clarify the points raised by reviewers in the camera ready version:

(1) The paper needs to discuss and examine failure cases. (2) The paper needs to clarify the level of heterogeneity w.r.t the performance or discuss it. (3) The absence of certain crucial details regarding the comparison raises concerns about the reliability of the results and the fairness of the comparison. For instance, there is a lack of explicit mention regarding the employment of augmentation techniques, specific hyper-parameter search procedures, and other relevant performance information concerning the existing methods. Please provide sufficient details either in the main text or the appendix.

Thanks for the high quality work!

Author Feedback

We deeply appreciate your insightful comments and constructive criticisms and are thrilled by the early acceptance, for which our responses are optional. Nevertheless, we are endeavoring to submit a significantly improved camera-ready version, addressing all your critiques, adding a table for training details, beautifying the figures, and improving the text for a more accurate, comprehensible, and compact presentation.

R1: (a) Ark and FL are orthogonal in objectives: Quoted from Gao et al. (arXiv:2210.04505), “[s]ince it was proposed, FL is meant to study the case where multiple clients holding homogeneous data train a single global mode collaboratively under specific privacy constraints.” By contrast, Ark accrues and reuses knowledge retained in ‘heterogeneous’ expert annotations with numerous ‘public’ datasets (no privacy concerns) for ‘centralized’ pretraining of generic source models transferable to application-specific target tasks. Motivated by your question, we have dived into the literature and discovered that our “multi-task heads via cyclic pretraining” has the potential to overcome the homogeneous limitation with conventional FL and will conduct comprehensive experiments to confirm this hypothesis, enabling ‘heterogeneous’ annotations across ‘private’ clients in FL and leading to a novel contribution to heterogeneous FL (Gao et al.). (b) Heterogeneity level w.r.t. performance: Ark-5/6 was trained with 335,484/704,363 chest X-rays from the first 5/6 datasets in Tab.1 collected by 5/6 different institutions around the world and annotated by their experts. We used their originally-provided labels (Tab.3), showing marked differences across institutions. Ark automatically handles the inherited level of label heterogeneity. We did not understand how you wanted us to vary heterogeneity levels in expert labels. If you could clarify your comments via the meta-reviewer, we would be happy to address them in depth. (c) Details: Sorry that not all training details could be included due to the page limit, but we will include a new table in the appendix by removing those section titles. Moreover, all code and pretrained models will be released; therefore, all details will be available on GitHub for reproducibility. Ark has a “task head” corresponding to each dataset to handle its labels as provided originally (Tab.3), covering binary, multi-class, and multi-label classifications. Ark’s “heads” share the same “body” and the training is to make the “body” superior and robust in performance and transferable to other tasks; in an ensemble of (binary) classifiers, those classifiers share nothing in architecture. As a result, Ark is not equivalent to an ensemble of binary classifiers. (d) Augmentation: For fair comparisons, we follow the SoTA: GitHub.com/JLiangLab/BenchmarkTransformers and apply the same augmentations for all methods.

R3: Failures: Before discovering “cyclic pretraining”, we experienced significant challenges in training Ark via “concurrent pretraining” (mentioned in Sec.2), where a mini-batch is formed by randomly sampling an equal number of images from each dataset, and the loss for each image is computed based on its associated dataset id and labels. The idea is intuitive, but the model hardly converges; we suspect that the loss summation over all task heads simultaneously weakens gradients for back-propagation, causing confusion in weight updating. Once we invented cyclic pretraining, the training became stable. We will detail our experience.

R4: Gender bias: Evaluating gender bias robustness requires the dataset to have gender information, but datasets 3, 4, and 10 don’t come with patient genders. We will clarify this to avoid confusion.

Meta-reviewer: Many thanks again for the early acceptance. We have taken into account your three comments and provided detailed responses in our replies to R3, R1(b), and R1(c,d). We would be happy to address any additional critiques that you and the reviewers may have.

back to top

Foundation Ark: Accruing and Reusing Knowledge for Superior and Robust Performance