List of Papers By topics Author List
Paper Info | Reviews | Meta-review | Author Feedback | Post-Rebuttal Meta-reviews |
Authors
Héctor Carrión, Narges Norouzi
Abstract
Skin diseases affect millions of people worldwide, across all ethnicities. Increasing diagnosis accessibility requires fair and accurate segmentation and classification of dermatology images. However, the scarcity of annotated medical images, especially for rare diseases and underrepresented skin tones, poses a challenge to the development of fair and accurate models. In this study, we introduce a Fair, Efficient, and Diverse Diffusion-based framework for skin lesion segmentation and malignancy classification. FEDD leverages semantically meaningful feature embeddings learned through a denoising diffusion probabilistic backbone and processes them via linear probes to achieve state-of-the-art performance on Diverse Dermatology Images (DDI). We achieve an improvement in intersection over union of 0.18, 0.13, 0.06, and 0.07 while using only 5%, 10%, 15%, and 20% labeled samples, respectively. Additionally, FEDD trained on 10% of DDI demonstrates malignancy classification accuracy of 81%, 14% higher compared to the state-of-the-art. We showcase high efficiency in data-constrained scenarios while providing fair performance for diverse skin tones and rare malignancy conditions. Our newly annotated DDI segmentation masks and training code can be found on https://github.com/hectorcarrion/fedd.
Link to paper
DOI: https://doi.org/10.1007/978-3-031-43990-2_26
SharedIt: https://rdcu.be/dnwLB
Link to the code repository
https://github.com/hectorcarrion/fedd
Link to the dataset(s)
https://ddi-dataset.github.io/
Reviews
Review #1
- Please describe the contribution of the paper
The authors present a denoising diffusion probabilistic model (DPPM)-based learning model that can segment dermatology images and classify them into malignant and benign classes. They show that such a model can be effectively trained when only a limited number of samples are available. They test their hypothesis on the DDI dataset by training models on three different skin tone classes with limited data.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The tackled problem is a very important issue in dermatology, where skin condition and disease types are very diverse. Not to mention, for many conditions and diseases, samples are scarce, therefore there is a big need for models that can learn from a few instances. The application is novel in dermatology. The set of proposes experiments are well planned and executed. The results are well presented The paper is clearly written and the presentation is good. It is easy to read and understand
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The main weakness of the paper is the lack of cross-validation in results. Not clear how the training/validation/test subsets are drawn DDI dataset is not balanced in terms of diagnosis. It is not clear why did the authors balance during subselection DDI dataset contains image-wise diagnostic labels. Why didnt the authors used the whole dataset for testing the classification pipeline.
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Overall, the paper is well written, and the methods are mainly well explained. Technical details are enough to replicate the work.
Data subsetting is the main issue in reproducibility. Since cross-validation results are not presented, it is impossible to understand if the results can be reproduced over another random selection of samples in the DDI dataset or if the results generalize to other images as expected.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
The overall presentation is very good, and the paper is easy to follow. Thanks to the authors for the clear presentation.
The main point of concern is the subselection of the train/val/test data samples from the DDI dataset. The DDI dataset comprises ~650 images, and the authors only used 180 of those in the presented work. It needs to be made clear how the selection was made. What are the main criteria for inclusion? How do the authors ensure that the selected subsets reflect the diversity in the dataset?
The reviewer believes that cross-validation is necessary to understand better the results and how they can generalize to other data. The reviewer acknowledges that pixelwise labeling of all the DDI image samples can be time-consuming and tedious, but the authors could at least provide (1) a better explanation of how the data samples were selected for this study and (2) a cross-validation test among the labeled samples.
It also needs to be clarified why the authors selected a balanced dataset during subsetting. The DDI dataset is imbalanced, reflecting the reality in the clinical setting. Data balancing may make the results look unrealistically better than the clinical setting. Also, how do they ensure that rare malignancies are included during subsetting rather than only more common ones, such as BCC and SCC?
Does each image in the study dataset contain each of the five labels (e.g., ruler, marker, etc…)?
Last sentence of Section 3.1: Could the authors provide a more quantitative measure than “most promising”?
Color labels are needed in Figure 5
In Figure 6, the reviewer expects the accuracy to go upward as the timesteps proceed. Also, in the caption authors mentioned that “Later steps in the reverse diffusion process produce the highest quality embeddings” If this is the case, why does the accuracy decrease? What is the measure of embedding quality?
On page 7, last sentence, the authors state, “As we increase the amount of data, the classifier has enough information to learn from the finer details of later blocks, boosting the performance” This statement is vague and hypothetical, requiring more explanation. The reviewer thinks the model has more data to train on, so it learns more reliable semantic information and performs better.
FEDD stands for ‘fair, efficient, diverse, diffusion-based framework’ Can the authors elaborate more on why the model is Fair and Efficient compared to other methods? This information could (and should) have been more explicitly presented in the paper.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
6
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The topic of the paper is an important issue in dermatology The paper is well written and the experimental design is good. Even if the technical novelty is limited, the clinical novelty is good.
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #2
- Please describe the contribution of the paper
This paper introduced a diffusion model based segmentation and classification method for skin lesion image analysis. The DDPM model was leveraged with output head upsampled for segmentation or downsampled for classification purpose. Experimental results with a subset of a public dataset show better and consistent classification accuracy / segmentation results.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The application of DDPM for skin lesion image analysis seems to be new and interesting.
- Authors have conducted comprehensive ablation studies, to show the effectiveness of the proposed method.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The novelty of the proposed framework / pipeline seems to be quite limited. It is based on the standard DDPM with some minor modifications.
- The experimental materials seem to be limited. It seems that only a small number of samples was used for the analysis.
- The annotation process is not clear. 5 types of classes were annotated including the lesion, skin, marker, ruler and the background. Were these annotation confirmed by clinicians?
- The comparison methods seem to be limited. Only the off-the-shelf models e.g., VGG, ResNet50 were used for comparison. It will be difficulty to understand the improvement to the state-of-the-art methods, especially methods optimized for skin lesion image analysis.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Source codes were not provided. Although, the experimental materials were conducted on a subset of public dataset. However, the selection process is unknown and the annotated data is not open to the public.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- Author claimed that none of the existing methods has explored segmentation and classification in the context from DDPM on dermatology images. It is true that authors proposed the first method. However, it’s not clear that the proposed method was designed for dermatology. The overall design seems to be a standard application.
- All the experiments were conducted from a range of 5% to 20%. It’s not clear why this range was selected. Why it needs to be cut at 20%.
- Fig. 4 top and bottom rows seem to be same?
- There are many parts need more justification: 1. FEDD’s efficiency. It seems that the efficiency was not evaluated. 2. FEDD outputs high-quality segmentations with less noise. What’s this noise referred to?
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
2
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The application of DDPM for skin lesion image analysis seems to be new and interesting. However, the technical novelties are limited. The experimental results with the limited number of samples may not be able to prove the effectiveness of the proposed method.
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #3
- Please describe the contribution of the paper
The paper proposes a method for lesion segmentation and classification using a features from the UNet decoder of a denoising diffusion probabilistic backbone pretrained using ImageNet. The method claims to achieve state-of-the-art performance on the Diverse Dermatology Dataset with just a subset of labeled samples.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Papers is well written in that its concise and free of major grammatical errors. The application of denoising diffusion probabilistic backbone for image segmentation is not entirely novel, but this work should be commended for applying the idea for multitask learning setting , i.e. simultaneous lesion segmentation and classification
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The authors mention that the DDPM used was a ImageNet pretrained model but did not clearly clarify of any of the layers of the original model weights/layers were fine tuned apart from the segmentation and classification heads.
- The number of samples per skin tone sampled for training, validation and test seems to be quite low and I am wary of the claims of such marked improvements in mIOU and classification accuracy reported in the paper. Unless authors can share code/implementation along with the dataset for reproducibility and verification, given the length and details provided in this short paper hardly convinces me to accept the claims at face value.
- The test samples used were a total of 30 samples (10 samples per skin tone). I feel this is too small a sample to generalize its performance on large databases or practical use in a clinical setting.
- Choosing the best performing block and timestep in the proposed method seems to be heuristically determined - it may vary as the dataset size increases, and could also vary from dataset to dataset.
- There is no mention of the computational requirements for training and running inference using this model
- While the method seems to be outperforming other methods in a small dataset regimes, I would like to see how the methods compares fi the entire dataset is used and the test set size increased , say by 10 fold ?
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
No reproducilibility data or code has been provided thus I am unable to verify claims made by the authors.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- provide code and dataset for reproducibility and matches claims on performance made in the paper.
- show how model performs as dataset is increased, esp. the test data (30 samples in total seems very low)
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
5
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- lack of reproducible code/data
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Primary Meta-Review
- Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
The paper under review proposes a denoising diffusion probabilistic model (DDPM) for dermatology image segmentation and classification into malignant and benign categories. The novelty lies in its demonstrated effectiveness when trained on limited data samples. The authors validate their approach on the Diverse Dermatology Images (DDI) dataset, specifically training their models on three distinct skin tone classes with limited data. The reviewers commend the authors for addressing a significant issue in dermatology, where diverse skin conditions and diseases often have scarce sample data. The application of DDPM in dermatology is considered novel. A major concern raised is the subselection of training, validation, and test data samples from the DDI dataset. The authors used only 180 out of approximately 650 images in the DDI dataset. Reviewers seek clarification on how this selection was made, the criteria for inclusion, and how the selected subsets reflect the diversity of the dataset. Concerns are raised about the small number of samples per skin tone used for training, validation, and testing, and the resultant generalizability of the model’s performance. The reviewer expresses skepticism about the reported improvements in mIOU and classification accuracy given the limited sample size. Overall, while the paper’s approach is appreciated for its novelty and tackling a significant problem.
Author Feedback
We thank the reviewers for their constructive feedback. Below are responses to their comments:
“The annotation process is not clear” (R2) An experienced medical imaging researcher annotated the dataset; all masks underwent a secondary review. All annotations will be published. The annotation protocol is as follows:
- Lesion is segmented following the boundary at which the skin transitions from healthy to unhealthy appearance, 2. Markings or rulers are segmented, and 3. Non-lesion skin is segmented.
The following criteria will trigger a skip: Lesion is occluded, significantly blurry, partially visible, ambiguous (none or multiple marked), or on-scalp (hair is not a labeled target). Examples can be found on DDI images 25, 55, and 161.
“The test samples used were a total of 30 samples. I feel this is too small a sample to generalize its performance” (R3) The skip criteria disqualifies many DDI images as they were not initially collected with computer vision in mind. This motivated our initial decision to limit our experimental data. However, we agree that this creates a small and potentially unrepresentative test set. We, therefore, expand our test set to leverage all annotations. This grew our test set by 6.6 fold, from 30 to 198 images: 59 light, 80 medium, and 59 dark in skin tone. “Data balancing may make the results look unrealistically better than the clinical setting” (R1) The expanded test set is skin-tone unbalanced, as expected in clinical settings.
We evaluated all previous baseline and FEDD checkpoints without re-training against the larger test set. “Why didn’t the authors use the whole dataset for testing the classification pipeline?” (R1) For classification, we now test on the full DDI dataset.
The FEDD checkpoints trained on 5%, 10%, 15%, and 20% of DDI obtained mIoUs of 0.70, 0.75, 0.76, and 0.77 with a classification accuracy of 74%, 81%, 75%, and 80%, respectively. These represent improvements of 0.18, 0.13, 0.06, and 0.07 mIoU with 14%, 6%, 5%, and 4% accuracy over the next best method (EfficientNet).
“Later steps in the reverse diffusion process produce the highest quality embeddings. If this is the case, why does the accuracy decrease?” (R1) We acknowledge that the wording is confusing “later in reverse” should be rewritten simply as “earlier.”
“Can the authors elaborate more on why the model is Fair and Efficient compared to other methods?” (R1) Fig 4. and Appendix Table 2 show superior and more consistent performance across different skin tones, supporting FEDD’s fair fairness for different skin tones. Figures 4, 5, and 7 show higher performance when trained on limited data, suggesting FEDD is label-efficient.
“FEDD outputs high-quality segmentations with less noise. What’s this noise referred to?” (R2) Here, we refer to fewer segmentation artifacts and false positives, shown in Fig 5.
“Only the off-the-shelf models, e.g., VGG and ResNet50, were used for comparison. It will be difficult to understand the improvement to the state-of-the-art methods” (R2) We compare against current DDI state-of-the-art for malignancy classification, DeepDerm, and FEDD shows better performance (Fig. 7). For segmentation, we compare against methods that can be similarly configured as UNets and are pre-trained on ImageNet. This work establishes the first diffusion-based benchmark for DDI segmentation and malignancy classification with fair performance across different skin tones. Future work will include the additional SOTA semantic segmentation techniques in diverse dermatology data settings presenting various skin conditions.
“There is no mention of the computational requirements for training and running inference using this model” (R3) Appendix Section 2 discusses software and hardware requirements as well as training and inference time.
“Lack of reproducible code/data” (R3) We will publish all code, data, and annotations upon publication.
The above clarifications will be incorporated into the final manuscript.
Post-rebuttal Meta-Reviews
Meta-review # 1 (Primary)
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The paper at hand introduces a methodology for lesion segmentation and classification that leverages features from the UNet decoder of a denoising diffusion probabilistic model. This model is pretrained on ImageNet, and the authors assert that it achieves state-of-the-art performance on the Diverse Dermatology Dataset, even with the use of only a subset of labeled samples.
This paper’s strengths reside in the innovative application of denoising diffusion probabilistic models (DDPM) to skin lesion image analysis, as well as the comprehensive ablation studies conducted to validate the effectiveness of the proposed methodology. Nonetheless, several weaknesses are apparent. Specifically, the novelty of the proposed framework seems modest as it is merely a minor modification of the conventional DDPM. Furthermore, the experimental materials appear to be somewhat limited, with a rather small sample size used for analysis. The clarity of the annotation process also comes into question, as it is not apparent whether it was confirmed by clinicians. Lastly, the use of off-the-shelf models for comparison complicates the understanding of improvements over state-of-the-art methods, particularly those tailored for skin lesion image analysis.
In their rebuttal, the authors adequately address the concerns raised. They clarify that the dataset was annotated by an experienced medical imaging researcher and subsequently underwent a secondary review. To address the issue of limited test samples, they broadened their test set by employing all available annotations, resulting in a 6.6-fold increase in size. Additionally, they confirmed that they tested the classification on the full Diverse Dermatology Dataset and provided relevant performance metrics. The authors also provide a satisfactory explanation for the observed performance discrepancies regarding the diffusion process and reassert the fairness and efficiency of their model relative to other methods. The concerns regarding the comparison models are addressed by noting that they have established the first diffusion-based benchmark for DDI segmentation and malignancy classification with fair performance across various skin tones. The authors further clarify that future work will incorporate additional state-of-the-art semantic segmentation techniques. Regarding computational requirements, they point out that these details are provided in an appendix. Lastly, they pledge to publish all code, data, and annotations upon the paper’s publication.
Taking into account the authors’ responses and their commitment to incorporating all raised concerns into the final manuscript, I am inclined to endorse the acceptance of this paper.
Meta-review #2
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The paper is interesting, well-written and tackles and important topic. However, I think that it requires significant changes after the rebuttal to get accepted. Specifically, the change of test samples from 30 to 198 might change the results presented in the original and peer-reviewed version of the paper.
Meta-review #3
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The authors’ response effectively addresses the primary concerns related to sample size, generalizability, and computational complexity. I am of the opinion that an efficient DPPM-based learning model, particularly when trained on limited data samples, carries substantial value. I recommend the paper’s acceptance. Should the paper be accepted, the authors should incorporate the discussions from the response into the final version of the manuscript.