Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Siyuan Yan, Chi Liu, Zhen Yu, Lie Ju, Dwarikanath Mahapatra, Victoria Mar, Monika Janda, Peter Soyer, Zongyuan Ge

Abstract

Skin lesion recognition using deep learning has made remarkable progress, and there is an increasing need for deploying these systems in real-world scenarios. However, recent research has revealed that deep neural networks for skin lesion recognition may overly depend on disease-irrelevant image artifacts (e.g., dark corners, dense hairs), leading to poor generalization in unseen environments. To address this issue, we propose a novel domain generalization method called EPVT, which involves embedding prompts into the vision transformer to collaboratively learn knowledge from diverse domains. Concretely, EPVT leverages a set of domain prompts, each of which plays as a domain expert, to capture domain-specific knowledge; and a shared prompt for general knowledge over the entire dataset. To facilitate knowledge sharing and the interaction of different prompts, we introduce a domain prompt generator that enables low-rank multiplicative updates between domain prompts and the shared prompt. A domain mixup strategy is additionally devised to reduce the co-occurring artifacts in each domain, which allows for more flexible decision margins and mitigates the issue of incorrectly assigned domain labels. Experiments on four out-of-distribution datasets and six different bias ISIC datasets demonstrate the superior generalization ability of EPVT in skin lesion recognition across various environments.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43990-2_24

SharedIt: https://rdcu.be/dnwLz

Link to the code repository

https://github.com/SiyuanYan1/EPVT

Link to the dataset(s)

https://github.com/alceubissoto/artifact-generalization-skin

https://github.com/jeremykawahara/derm7pt

https://www.fc.up.pt/addi/ph2%20database.html

https://data.mendeley.com/datasets/zr7vgbcyr2/1


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a domain adaptation framework to alleviate the artifacts-based biasing of the skin cancer diagnosis models. The model can detect the artifact in the input image, generate a domain prompt from it and then integrate this information into the transformer-based classification pipeline to help the model specialize to different artifacts.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Representing artifacts as prompts that can be automatically generated from the image is a novel idea in dermatology.

    The architecture, overall, is novel for dermatology.

    Using an adapter to learn the correlation between the domain prompts and use this information to weigh the prompts is a novel idea.

    The experimental setup is good. The authors provide ablation, trap set, and prompt weight analysis to support their findings.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In several places, it is hard to follow ideas and how they are implements It hard to understand if the prompts are given or learnt. If they are learnt, then it is not clear how this is done

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Except the details about the prompt generator are missing, the reviewer thinks that the described algorithm is reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Are the prompts learned or are they given by the user? In prompt tuning they are provided by the user, but in this study it is not clear if they are learnt or given. In section 2.1 authors say, the prompts are defined, but then in the following parts of the section, they say prompts are learnt. This section is confusing. Due to similar reason, Cross-domain knowledge learning is also confusing The relevant sections can be improved with more details.

    Section 2.2 is not clear. Where does P*, u_k, v_k come from. How does the model learn them? Please make this more clear.

    Inference time model should be discussed/described implicity. The reviewer has the impression that the prompts are not required during the inference. But this is never explicitly mentioned.

    In Ablation studies section; “…. but it performs worse than ERM on PAD dataset … “ rather than PH2

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The presented ideas are novel in dermatology. The experiments are well designed and conducted. The presentation requires a bit of improvement for clarification of some concepts as well as reproducability.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper introduces a transformer based domain adaption method to produce a consistent classification results for skin lesion images acquired from different / unseen domains. The overall architecture follows a standard transformer based domain adaption framework. A domain prompt generator was developed to learn domain specific weights such that can guide the learning process. The domain mixup module was designed to handle images that may come from multiple domains. The experiments on multiple large datasets seem to be comprehensive.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper introduces a novel approach to tackle an important problem of domain shift in skin lesion images.

    The manuscript is well organized, which made easy to follow.

    The experimental results, including the number of experimental samples and comparison methods are comprehensive.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The approach of addressing domain shift problem by separating the training images into different domains seems to be problematic. The underlying hypothesis is that irrelevant image artifacts are the major reasons causing poor performance for unseen images. However, other factors such as the illumination, scanning protocols should affect the performance much more.
    2. In the experiments section, only the derm7pt_c dataset can be considered as a separate domain, where all the other images seem to be dermoscopy images and should be considered as the same domain as the training images. As can be seen in Table 1, only the derm7pt_c dataset has a much lower performance.
    3. The training dataset was separated into five groups. There is also a mixup module that allows to handle the images acquired from multiple groups. It’s not clear how this was achieved. With the mixup module, should the network also work without grouping process?
    4. There are also many unclarities, which can be found in the following sections.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The experiments were conducted on public datasets. The supplementary includes source codes. Therefore, the paper should have high reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. There are many terms were defined without any proper explanation. E., how the prompt and low intrinsic rank work?
    2. It’s not clear why the baseline algorithm was set to the ERM algorithm. Should it be the performance without domain adaption?
    3. How were the bias, domain distance and domain weights calculated.
    4. In Table 1, although the proposed method achieved a relatively good performance when compared to the current methods. However, it’s challenging to understand that how the proposed method managed to achieve such a performance.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper attempts to address an important problem for skin lesion image analysis. However, the method design seems to be unclear. The experimental results seem to be inconclusive.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The proposed domain generalization method based on environment prompt is novel. I like the design of the domain prompt generator based on low rank multiplicative for cross-domain knowledge learning is interesting. The method has been evaluated on multiple different skin lesion datasets, with improved performance compared to ERM and other baseline methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • the idea of using environement as prompt for domain generalization is interesting and has great novelty.
    • I like the debiasing evaluation intogether with the correlation study of domain distance and prompt weight analysis in Fig.3. The investigation looks very interesting.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Clarity: Eq.1, what does the classification token look like? what’s the dimension of it, and how to obtain it? Please clarify.
    • Clarity: An clear overview about the inference procedure is needed given the complexity of the whole framework, which however, it is missing.
    • Evaluation: The authors claim that they adress the co-artifacts problem with the domain mixup and the domain prompt learning. However, there is no direct evidence or evaluation on that to support that. For example, when the test input has a mixed of artifacts, how does the proposed method act differently from each other? What’s the output of the adapter? The correlation study between domain distance and domain weights is not sufficient .
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The research has demonstrated strong reproducibility by providing their code in the supplementary section and making their data readily available to the public. It showcases a high level of transparency and accessibility from the authors.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Choice of methodology: The authors introduces an domain adaptor to predict linear correlation between the domain prompts and the target image. It also employs mixup for regularization. However, it seems that currently the two mechanisms work independently, by simplying adding two losses together. The reviewer is curious about whether using mix-uped images as input to train the adapter to learn a linear correlation between different domains can result in a better performance.
    2. It is better to add mathmatical symbols (e.g., A, F,) of different modules in Fig 1 for ease of understanding and better visualization
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good novelty but some parts of the methodology are unclear.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    I decided to keep my original decision unchanged. There are some limitations from the methodology design in addressing co-artefact images. Also, it may not be that easy to extended to other applications due to lack of clear separation of datasets. Nonetheless, I found the paper to be quite interesting and has some novelty inside.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All reviewers mentioned that the details of proposed method and implementation should be further explained, which make the readers more easy-to-follow. The experimental setting for separate domain may be problematic. Most reviewers are confused on the mixup operation.




Author Feedback

We appreciate all reviewers found our method novel, experiments comprehensive, and interesting. R2 mainly concerns the domain separation setting(Q1) and OOD experiment setup(Q2), while R1 and R3 both mention they are the novel part of our paper. We will address all main issues and update the necessary missing details of our method part. R2: We would like to kindly correct the R2 our work is not about domain adaptation(DA), and it is domain generalization(DG). This might heavily affect your understanding of our work. Q1 Artifacts-based domain separation is problematic: (1) In DG, we cannot know the factors that “most significantly” affect target domain because the target domain or domain shift during training is unknown. This is a fundamental difference from DA. Thus, the assumption of R2 that illumination affects more than artifacts is not valid. Also, the use of artifacts as domain separation is reasonable, as numerous papers [3-5,29] in dermatology have proved that artifacts heavily impact the generalization of models. (2) Visual Representation Learning over Latent Domains(ICLR) empirically showed that no formal definition what is the optimal domain separation in DG and domain separations in many well-known DG datasets are sub-optimal. In our dermatology work, we introduce the artifacts-based solution and have demonstrated its effectiveness by benchmarking 11 DG methods in Table 1. All models outperform the baseline that does not use this domain separation. Q2 Only clinical data is OOD data: There is no definition that OOD test data must be different modality from the training data. OOD refers to a large domain distance between the training and testing set. The domain distances are 43.6, 216.4, 122.82, 187.85, and 246.75 for in domain test set and our 4 OOD dataset. Clearly, the domain distances in all OOD datasets are substantially larger than in the in-domain test datasets. Also, our OOD setup is common, papers like Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization(NeurIPS) and Generalizable feature learning in the presence of data bias…(MICCAI) also utilized dermoscopic datasets as both training and OOD test sets for DG experiments, further showing the validity of our setup. Q3 Mixup (R2R3): Every image in a batch is combined from two images from two randomly selected domains. The effectiveness can be proved in table 2; it utilizes inter-domain information to mitigate most co-artifacts problems but may not address all the corner cases mentioned by R3. Q4: Why is the baseline ERM? ERM (Empirical risk minimization) is a well-known baseline in DG. It is a model optimized by cross-entropy loss in our case. Q5: How are bias, domain distance, and weights calculated? Distance is calculated using the Frechet distance on the extracted features between the source and target domain data. Bias is determined by the probability of selecting an image with artifacts defined in [3]. The weights are unnormalized w_m in Eq. 4. Q6: Why it achieves such performance? We explained why each component is necessary in the first few sentences in each section from 2.1-2.4 and the introduction. Table 2 provides an ablation study demonstrating the importance and effectiveness of each component. Q7 How prompt and generator work(R1R2)? We define M learnable prompt vectors, where each domain prompt is optimized solely for data from the corresponding domain. Our model includes: P^m (m=1-5), which are domain-specific prompts, and P, which is a shared prompt. P^m can be obtained through P⊙(U_kU_v), where U_k,U_v denotes the low-rank space for the kth domain prompt, used for low-rank space optimization. The design encourages cross-domain knowledge learning. Please refer to Fig2a, 2b and sec2.2. Q8: Inference (R1R3) It incorporates a weighted prompt based on domain importance calculated by the adapter. Q9: class token(R3)? The class token is learnable, randomly initialized 1x768 vector.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal does not fully address the concerns raised by the reviewer.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Though this paper may have some interesting application in this data, the current form is not well written and hard to make justification by all reviewers.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper is of good quality, but it does have a few weaknesses. One of its strengths is that the research is easily reproducible because the authors provide their code in the supplementary section and make their data available to the public. The ideas presented in the paper are new and exciting, especially in the field of dermatology. The experiments are well-designed and well-executed. However, the way the information is presented could be improved to make some concepts clearer. The authors have addressed and clarified some major concerns and misunderstandings in their rebuttal, so I think the paper should be accepted.



back to top