Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Deval Mehta, Yaniv Gal, Adrian Bowling, Paul Bonnington, Zongyuan Ge

Abstract

Recent years have witnessed a rapid development of automated methods for skin lesion diagnosis and classification. Due to an increasing deployment of such systems in clinics, it has become important to develop a more robust system towards various Out-of-Distribution (OOD) samples (unknown skin lesions and conditions). However, the current deep learning models trained for skin lesion classification tend to classify these OOD samples incorrectly into one of their learned skin lesion categories. To address this issue, we propose a simple yet strategic approach that improves the OOD detection performance while maintaining the multi-class classification accuracy for the known categories of skin lesion. To specify, this approach is built upon a realistic scenario of a long-tailed and fine-grained OOD detection task for skin lesion images. Through this approach, 1) First, we target the mixup amongst middle and tail classes to address the long-tail problem. 2) Later, we combine the above mixup strategy with prototype learning to address the fine-grained nature of the dataset. The unique contribution of this paper is two-fold, justified by extensive experiments. First, we present a realistic problem setting of OOD task for skin lesion. Second, we propose an approach to target the long-tailed and fine-grained aspects of the problem setting simultaneously to increase the OOD performance.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16431-6_69

SharedIt: https://rdcu.be/cVD7p

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a mixup augmentation and prototypical learning based OOD detection method for skin lesion images. They first group the classes into 3 main groups (Head, middle and tail) according to their occurrences in the dataset. They show that mixing up samples from middle and tail classes helps in learning better representations for the purpose of OOD detection. They integrate their mixup strategy into prototypical learning to enhance the capability of their model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    As far as the reviewer knows, mixup between different occurrence groups is a novel approach (at least for skin lesion images). Authors extensively tested on various combinations of mixup.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It is not clear if mixing up the lesion images in the input pixel domain makes sense. The mixed up images does not look realistic and it is not clear why the model should learn a better representation from them. For example, authors mentioned that they also included some unusual samples, such as blured images as the OOD class. Depending on the mixup weights, a mixup lesion can easily look like a blurred image, so it is not clear if training in this way helps during such situations.

    It is not clear how susceptible the model is to the selection of the 3 groups (head middle tail). How did the authors pick the cutoffs between the groups? Did the authors try to pick different cutoffs to group classes in a slightly different manner and conduct experimetns? Authors did not touch any of these issues.

    It is not clear how did the authors train the prototypical network.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The used framework is not mentioned. The imaging modality for the inhouse dataset is not mentioned. Sensitivity regarding hyperparameters is missing

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    An ablation study on the susceptibility of the model against the selection of grouping is missing.

    It is not clear how did the authors train the prototypical networks. How did they decide on the feature level to be picked for extracting embeddings?

    Equation 3 is not clear. Why do the authors put d_{xpi} through the network again? Assuming that “f” represents the model, why and how would the output of f(d_{xpi}) be y_i?

    What type of images is the in-house dataset composed of. Dermoscopy images, clinical images?

    The authors did not compare their model against the algorithms from the literature that were designed explicitly for OOD detection for skin images. Some examples are; -M. Combalia, F. Hueto, S. Puig, J. Malvehy, and Verónica Vilaplana. Uncertainty estimation in deep neural networks for dermoscopic image classification. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3211–3220, 2020. -Abhijit Guha Roy, Jie Ren, Shekoofeh Azizi, Aaron Loh, Vivek Natarajan, Basil Mustafa, Nick Pawlowski, Jan Freyberg, Yuan Liu, Zach Beaver, et al. Does your dermatology classifier know what it doesn’t know? detecting the long-tail of unseen conditions. arXiv preprint arXiv:2104.03829, 202 -On Out-of-Distribution Detection Algorithms with Deep Neural Skin Cancer Classifiers. Andre GC Pacheco (Federal University of Espírito Santo, Brazil)*; Chandramouli Sastry (Dalhousie University, Canada); Thomas Trappenberg (Dalhousie University, Canada); Sageev Oore (Dalhousie University, Canada); Renato Krohling (Federal University of Espírito Santo, Brazil) -Torop, M., “Unsupervised Approaches for Out-Of-Distribution Dermoscopic Lesion Detection”, arXiv e-prints, 2021. -Subhranil Bagchi, Anurag Banerjee, and Deepti R Bathula. Learning a meta-ensemble technique for skin lesion classification and novel class detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 746–747, 2020.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is overall welll written and clear except at a few places. Good results.

    The detail of the prototypical network training is missing. Missing references and comparisons against the literature.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes an out of distribution (OOD) detection method with application on long-tailed skin lesion datasets. The method consists of two parts: i) an augmentation strategy (mixup) targeted to middle and tail classes in the dataset (classes are categorized according to number of samples in the dataset); and ii) prototype learning integrated with the augmentation strategy.

    The authors experiment with mixup augmentation targeted to different parts in the dataset and concluded that targeting the middle and tail classes yielded the best performance. They’ve done experiments on ISIC 2019 and an in-house dataset to evaluate the ability of the proposed method to detect out-of-distribution samples.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is generally well-written and ideas are well-presented.
    2. The authors does a good job in citing the previous work, exploring the different OOD methods, and briefly explaining the idea of the relevant papers.
    3. The authors identify a limitation with the current OOD methods as these methods are not suitable for practical clinical applications where the differences between in-distribution (ID) and OOD samples are visually subtle. They design their study to be able to tackle this limitation and also evaluate on a real-world clinical dataset collected by the authors.
    4. The idea of using targeted mixup augmentation and prototype learning in OOD is novel and seems to give better performance than baselines.
    5. The method is evaluated against several OOD methods and on two different datasets. The proposed method has superior performance in OOD in both ISIC and the in-house dataset, this shows the robustness of the proposed method in real-world clinical datasets.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The prototype learning part needs more explaination.
    2. Confidence score section is interesting but needs more explaination. How is it computed?
    3. No statistical significane of the results is reported
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    1. Regarding #1 in weaknesses, please ellaborate more on the prototype learning part. What is the standard prototype loss? How do you know/compute the class specific prototypes {p_i,p_j}? Why does the authors think it will help detecting the fine-grained samples in the dataset?
    2. Regarding #2 in weaknesses, please provide more details on the confidence score computing. Is that the predicted softmax probabilities?
    3. It would be interesting and more convincing to readers to see statistcal significance of the reported results.
    4. I guess the reported results on the in-house dataset is on the test set, but on ISIC2019 the authors report results on the validation set (cross-validation). That needs to be mentioned in the text, e.g. in section 3.1.
    5. “It can further be noted that our proposed approach of M-T mixup (MX5) strategy combined with prototype learning performs the best for OOD detection of all different categories while maintaining or slightly improving the overall ID performance compared to the baseline on both the datasets.” I disagree, the proposed strategy gives the best for OOD detection, but it does not improve the overall ID performance compared to other methods, as seen from table 2. For example, ARPL is consistently better than the proposed method on the in-house dataset. This is okay as the method is primarily an OOD method, but the sentence above needs to be adjusted.
    6. Table 2: please either make all best scores in bold or unbold all numbers. For example, in the pre and rec columns for the in-house dataset, there are no bold values, unlike the rest of the table.
    7. What are the values of the different lambdas used in your experiments, e.g. in eq. 2, eq. 3, lambda_mse, and lambdas in fig. 2? You might mention only the values used for the best performing method (MX5) in the main text and move the rest to an appendix. You might also want to use a better notation for clarity, rather than refering to all of them as lambda.

    8. Minor: 8.1: “… verified dermoscopic images categorized amongst 65 different …”: might be better to say categorized into 8.2: “… This simple technique has shown to increase …”: has been shown 8.3: “… thus limiting it’s advantages …”: its 8.4: “… which are the standard parameters for measuring …”: these are metrics not parameters
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The authors provide most of the implementation details except for the values of hyperparameters (lambdas). Providing these additional hyperparameters in the revision would facilitate the reproducibility.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well-written and organized. The presented idea is interesting, has been adequately evaluated on two different datasets, and performs better compared to other methods. The idea is motivated by the limitation of other methods when applied on real-life clinical datasets. I think the paper will be interesting to MICCAI audience after doing the requested revisions to some parts.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper describes a method combining the Mixup method with prototype learning aiming to improve out-of-distribution detection in the context of dermatology imaging. The authors propose several mixup strategies and report a comparison to a baseline as well as several state-of-the-art methods demonstrating the superiority of the proposed method in the context of unknown data of the same nature (i.e., unknown classes). The paper uses both, private and public (ISIC2019) datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Overall the paper is clear and quite comprehensive. The addressed topic is clinically and methodologically relevant and the proposed method has a potential to be applied to other types of the imaging. There is a reasonable amount of ablation and comparison reported to allow a fair understanding of the method’s interest .

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some parts of the paper may be worth revision for more clear and straightforward presentation. That is, the variety of the mixup strategy might be better presented to facilitate the comprehension. Moreover, the results are worth further discussion as the reported Mixup results appear to outperform the proposed method.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors give some details on the used private dataset and also describe how they used the ISIC dataset. The authors provide reasonable amount of information on the training and implementation, as well as on the metrics being used.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
     There are several modification that could allow the improvement of the paper.
     Method:
     - Could the authors detail how the lambda is selected as it does not clearly stand out from the paper?
     Experiments:
     - could the authors detail why 20 classes were chosen as OOD? Is there any ablation having been done?
     - Could the authors revise ISIC class labels as it might be not clear for unaware reader?
     - Could the authors specify how the precision, recall and F1 are calculated for multi-class classification in imbalanced dataset?
     - Could the authors motivate the choice of the ResNet34?
     Results: 
     - Is it possible to report the classification metrics (prec, rec, f1) for the tail classes?
     Mixup Ablation
     - could the authors clarify/discuss more their choice of M-T strategy for further experiments?
     Conclusion
     - The authors state "OOD techniques are still far away from clinical deployment". Could the authors provide further details of such conclusion?
     Syntax and writing:
     Introduction:
     - in "There are two shortcomings in this aspect - 1) ... 2)" could the authors revise the syntax of the items 1) and 2) to facilitate the reading
     - The main contribution appear to be somewhat drawn in the text. Could the authors consider outlining it differently?
    
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper appears to be quite clear and the reading flows well. The OOD issue, studied on the dermatology imaging, is relevant to other imaging areas as well so may be interesting for the community.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper addresses the out-of-distribution detection of skin images by integrating two recent ML concepts: The mixup method and prototype learning. The three reviewers agree on the clinical pertinence, the idea holding novelty, and the potential impact for the MIC community beyond the dermatology applications. The experimental evaluation was also deemed sufficient, with evaluation over two different datasets (including a public one), several mixup schemes, and a comparison against other methods with good performance. Points to address in the rebuttal and the revised version are: -Clarify the description of the prototype learning approach and the data modalities. -Revise state of the art to include ODD methods for skin imaging. -Reproducibility: discuss how the hyperparameters (lambda) were chosen and provide their values. -Further discuss the motivation for pixelwise mixup and the different mixup strategies. -Discuss the sensitivity to group selection

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

Thanks to all the reviewers for providing constructive feedback. Below we provide our response to the major points (minor comments will be integrated in the revised paper).

Reviewer #1: [Mixup between skin lesion images in pixel space] Mixup has been proved useful for uncertainty estimation and specifically to improve OoD detection performance for natural images datasets ([31] and https://proceedings.neurips.cc/paper/2019/file/36ad8b5f42db492827016448975cc22d-Paper.pdf). Naively adopting mixup for skin lesion images improved the OoD performance. Although the mixup sample may look confusing, it helps the model to differentiate between the two categories in its sample better due to the weighted loss. However, we do notice in our experiments that the amount of mixup between two samples affects the OoD performance (supplementary material). We suggest that this amount must be judiciously selected for specific datasets depending on the nature of the images.

Reviewer #1 and #2: [Combination of Mixup and Prototype learning – Detailed explanation] The mse loss and distance based-cross entropy loss between the latent features from an encoder (e.g. Resnet) and the corresponding prototypes of all the categories constitutes the prototype loss in a standard training. This enables to reduce the intra-class variance and increase the inter-class distance in the feature space which helps to improve fine-grained classification. In our approach, we believe that this can be further enhanced by combining it with the Mixup technique. However, naïve combination won’t serve a good purpose for OoD detection as the overall learning will still be dominated by the head classes. So, we only use this combination of Mixup and Prototype learning for the M-T mixup strategy (best performing for OoD). For a mixup sample, the prototype loss is changed to a weighted mse loss and weighted distance-based cross entropy loss, where the weights are corresponding to the amount of the mixup applied for each category. In this way, the learning of the network parameters is aligned with the amount of mixup. The intuition behind this idea is that the features extracted by the network for a mixup sample will overall represent the weighted semantics of both the lesion categories present in the mixup sample. Thus, the same weighted (as the mixup sample) loss has to be applied to these features so that the feature space becomes even better for OoD detection. The equation (3) has a minor error where the distance d_{xij} is directly compared with the prototypes and it is not f (d_{xij}). Thanks to Reviewer #1 for pointing this out.

Reviewer #1 and #3: [Division of the dataset into head, middle, and tail subsets and 20 OOD classes for our in-house dataset] The selection of the grouping is based on the sample distribution amongst the categories in the dataset. For our in-house dataset, we had a clear cut-off for head categories as they had more than 10,000 samples, and middle categories were selected as between 1,000 – 10,000 samples and for tail it was less than 1,000 samples. Usually with such a dataset distribution, the head categories will be less number (6 in our case), middle will be moderate (17 in our case), and tail are the most (42 in our case). For OOD categories, we wanted to split the whole tail subset almost equally, so we reserved the last 20 classes as the OoD for our in-house dataset. Considering the ISIC 2019 dataset, as it is not a long-tail dataset, the most straight-forward to do it was to split it into parts of 2 categories per each (head, middle, tail, and OoD). Our opinion is when splitting the dataset into head, middle, and tail categories, it is important to follow the realistic scenario where most samples will be concentrated in the head categories, with moderate number of samples in the middle and the least number of samples in the tail. An ablation study on this with different medical domain datasets will be included in the extension of this work.



back to top