Authors

Jiarong Ye, Haomiao Ni, Peng Jin, Sharon X. Huang, Yuan Xue

Abstract

Deep learning based medical image recognition systems often require a substantial amount of training data with expert annotations, which can be expensive and time-consuming to obtain. Recently, synthetic augmentation techniques have been proposed to mitigate the issue by generating realistic images conditioned on class labels. However, the effectiveness of these methods heavily depends on the representation capability of the trained generative model, which cannot be guaranteed without sufficient labeled training data. To further reduce the dependency on annotated data, we propose a synthetic augmentation method called HistoDiffusion, which can be pre-trained on large-scale unlabeled datasets and later applied to a small-scale labeled dataset for augmented training. In particular, we train a latent diffusion model (LDM) on diverse unlabeled datasets to learn common features and generate realistic images without conditional inputs. Then, we fine-tune the model with classifier guidance in latent space on an unseen labeled dataset so that the model can synthesize images of specific categories. Additionally, we adopt a selective mechanism to only add synthetic samples with high confidence of matching to target labels. We evaluate our proposed method by pre-training on three histopathology datasets and testing on a histopathology dataset of colorectal cancer (CRC) excluded from the pre-training datasets. With HistoDiffusion augmentation, the classification accuracy of a backbone classifier is remarkably improved by 6.4% using a small set of the original labels. Our code is available at https://github.com/karenyyy/HistoDiffAug.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_71

SharedIt: https://rdcu.be/dnwzD

Link to the code repository

https://github.com/karenyyy/HistoDiffAug

Link to the dataset(s)

https://github.com/AdalbertoCq/Pathology-GAN

https://warwick.ac.uk/fac/cross_fac/tia/data/pannuke

https://zenodo.org/record/5337009

https://zenodo.org/record/1214456

Reviews

Review #3

Please describe the contribution of the paper

The authors present HistoDiffusion: an augmentation image synthesis pipeline consisting of 1) unconditional large-scale unlabeled latent diffusion model (LDM) pre-training, 2) conditional fine-tuning on an unseen labeled small set through a latent classifier, and 3) selection of the highly-confident synthetic samples based on feature similarity with real data. The data employed, are three histopathology sets (breast, pan-cancer) in the LDM pre-training (800K images), and a colorectal cancer set for fine-tuning (5K) and testing (10K). The generated images are compared with a selective cGAN augmentation method (StyleGAN2-based) showing better FID scores, while the presented classification metrics are also improved for all the tested real+augmentations schemes.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper deals with augmentation image synthesis for low-scale data classification - leveraging large-scale (compiled) unsupervised datasets. The topic is clinically relevant, especially with the potential to use the same pre-trained model to fine-tune several small set, supervised classifiers.
2. The paper is well structured and written, with most aspects clearly described.
3. The authors combine current, generative modeling and other advances (LDM, classifier-guided diffusion sampling, selective augmentation), with a small set fine-tuning process (that includes an auxiliary classifier when fine-tuning the decoder), to formulate a new synthesis pipeline that yields better results than the cGAN method they compare with.
4. Quantitative and qualitative results to support the case.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Table 1 in the supplementary shows the clear benefit of the small-scale fine-tuning process on synthetic image quality, and the smaller-scale improvement of the large-scale pre-training step; implying that it might be needed a greater level of data diversity instead of only sample size. It would add value to the analysis to provide similarity metrics between the large- and small-scale datasets and results for another small set fine-tuning (using the same pre-trained model).
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Reproducible. The code and pre-trained model release would be appreciated.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. Figure 2 could be info-graphically improved to include more connections with the HistoDiffusion description in 2.2 (e.g., the D’ can be noted), and the selective augmentation process - for a clear overview of the whole proposed pipeline.
2. It would be good to include the details of the datasets, statistics, splits and the 5% selection of the small scale set - to overview potential class imbalances, get informed if patient-level splits are preserved, etc. (these could be added to the supp).
3. Regarding the multi-class classification metrics, please include if the are micro, macro or weighted. It would be of interest to see the per class scores (or a confusion matrix) to assess if there are clear benefits to specific classes (especially in cases of imbalances).
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is an important topic, with clinical applicability potential, leveraging current generative modeling advances and large-scale unlabeled data. The analysis and experimental results suggest that the proposed synthetic images augmentation pipeline outperforms the previous work when fine-tuning a low-data supervised classifier.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The authors propose a novel synthetic augmentation model for histopathology image classification based on Latent Diffusion. Their method is more effective than synthetic augmentation based on StyleGAN at generating a large number of realistic images for training a downstream image classifier.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main strength of the paper lies in the application of Latent Diffusion models (LDM) to the task of conditional image generation for histopathology. The authors combine the generative power of LDM with selective filtering of resulting images for a high quality training set of up to 3x the size.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The technique offers only marginal improvement in classification scores over the previous SotA despite significant implementation complexity (multiple networks and training steps required). Also, a potentially powerful and simple approach to training was not tried: pre-train with 95k NCT-CRC-HE images, ignoring labels, then fine-tune the conditional decoder with the 5k remaining labeled images.

Further, the complexity of adopting synthetic augmentation tech in general should be weighted against simply annotating additional samples. It would be interesting to measure the performance of a 300% model trained on exclusively real data as an upper bound.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper describes the model architecture and references the appropriate implementation details. Still, the paper in general is missing an analysis of sensitivity to hyper-parameters or variance with respect to stochastic inputs (weight initialization, training data selection, stochastic gradient decent, etc) resulting in scores without any uncertainty.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

This is an interesting approach to conditional synthetic data generation which I believe will eventually lead to a significant improvement in the quality of downstream classifiers. However, I am nervous that the complexity of the approach outweighs the marginal gains to be made over cGAN or simply annotating additional images. However, it may be that the method actually allowd more stable and consistent training that StyleGAN based approaches. That could be quantified and would be an interesting result!
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is an interesting overall model architecture and training scheme which should be more widely known to MICCAI community.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

The article proposes a synthetic augmentation method called HistoDiffusion for deep learning based medical image recognition systems. The proposed method trains a latent diffusion model on unlabeled datasets and further finetuned on a small number of labeled datasets with conditioned classifier. The proposed method significantly improves the classification accuracy of a backbone classifier by 6.4% using a small set of original labels, making it useful in scenarios where obtaining labeled training data is difficult.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- very well organized and easy to follow
- the approach is well designed into two stages, by leveraging massive number of unlabeled data, on a small number of labeled data
- results look promising, by comparing the value of augmentation between the proposed approach and a state-of-the-art method
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- lack of ablative study for each stage of the proposed method
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

the authors are committed to provide source code, the reproducibility should be fine
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

It would be good to mention the following two points. 1) what’s the major contribution of the proposed method over ref [25] 2) which component of the proposed method turns out to be most effective for data augmentation, is it the baseline LDM, or the few labeled fine-tuning, or the class-conditioned synthesis? this would help readers to understand which step of training is most critical.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall very well written paper, algorithm novelty is sufficient, a few more ablative studies would make it better.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper introduces a synthetic augmentation method based on Latent Diffusion for histopathology image classification.

All reviewers agreed on the novelty of contributions with state-of-the-art generative modeling, strength of experimental demonstration, and high relevance to the MICCAI community.

Authors should address reviewer feedback in the camera ready. In particular suggest to elaborate on the contributions over related work; include some discussion on which of the proposed components most enhance efficacy; and outline potential ways to reduce complexity of the approach.

Author Feedback

We are appreciative of the insightful feedback offered by all reviewers and the meta-reviewer. We are also encouraged that reviewers found the topic of our paper important to the MICCAI community, our approach clinically applicable, and our results convincing. We will address reviewer feedback in the camera ready, and we will release the source code, pre-trained model, and our final model after acceptance.

Our proposed generative model, while it builds upon [25], presents key innovations. The most important contribution is the carefully designed pre-training process using large-scale unlabeled data and the unique fine-tuning process using small-scale labeled data, which give our approach the great potential for clinical applications. In addition, our proposed framework, especially with the incorporation of the proposed fine-tuning with auxiliary classification loss, has demonstrated superior classification performance when compared with previous state-of-the-art methods, as evidenced in our extensive evaluations. Importantly, our proposed large-scale pre-training does not necessitate any annotated data and is not feasible through previously established models such as StyleGAN.

We wish to highlight that despite the multi-component nature of our proposed framework, all components are indispensable for creating a versatile, pre-trained foundation model. The reason why we did not include component-wise or stage-wise ablation is that the approach is not viable with any component or stage unplugged or replaced. In our experiments, we ensured that no overlapping datasets were present in pre-training and fine-tuning, enabling our proposed method to work for fine-tuning with unseen subjects in a real-world clinical setting. Users of this model would be exempt from the necessity to re-train the diffusion model and would only be tasked with implementing targeted, small-scale fine-tuning based on their specific downstream recognition tasks. While the pre-training part is complicated, our approach negates the need for the most intricate and time-intensive pre-training step in user fine-tuning, rendering the process considerably more efficient in practice. We contend that this approach is a far more practical and resource-conscious alternative to annotating additional training images.

We thank the reviewers for providing constructive feedback about changes to improve clarity. We are committed to addressing such feedback; we will include an info-graphically improved Fig. 2, and will ensure that additional details pertaining to hyperparameters, variance w.r.t. stochastic inputs, multi-class classification metrics, datasets, splits, and selection of the small-scale set are added in the camera-ready version.

back to top

Synthetic Augmentation with Large-scale Unconditional Pre-training