Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Sontje Ihler, Felix Kuhnke, Svenja Spindeldreier

Abstract

Computer aided diagnosis (CAD) has gained an increased amount of attention in the general research community over the last years as an example of a typical limited data application - with experiments on labeled 100k-200k datasets. Although these datasets are still small compared to natural image datasets like ImageNet1k, ImageNet21k and JFT, they are large for annotated medical datasets, where 1k-10k labeled samples are much more common. There is no baseline on which methods to build on in the low data regime. In this work we bridge this gap by providing an extensive study on medical image classification with limited annotations (5k). We present a study of modern architectures applied to a fixed low data regime of 5000 images on the CheXpert dataset. Conclusively we find that models pretrained on ImageNet21k achieve a higher AUC and larger models require less training steps. All models are quite well calibrated even though we only fine-tuned on 5000 training samples. All ‘modern’ architectures have higher AUC than ResNet50. Regularization of Big Transfer Models with MixUp or Mean Teacher improves calibration, MixUp also improves accuracy. Vision Transformer achieve comparable or on par results to Big Transfer Models.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16431-6_62

SharedIt: https://rdcu.be/cVD7i

Link to the code repository

https://gitlab.uni-hannover.de/sontje.ihler/chexpert5000

Link to the dataset(s)

https://stanfordmlgroup.github.io/competitions/chexpert/


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a methodical study of modern architectures applied to a fixed low data regime of 5000 images on the CheXpert dataset. Specifically, it studies the BiT and ViT models through experiments, as well as established regularization methods Mean Teacher and MixUp. These are compared to the well-known and frequently used ResNet50 architecture. They find that models pretrained on ImageNet21k achieve a higher AUC and larger models require less training steps. All models were quite well calibrated and performed well. Regularization of Bit-50x1 with MixUp or Mean Teacher improves calibration and accuracy. Vision Transformer achieve comparable or on par results to Bit-50x1.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is a solid set of experiments on using modern architectures on relatively small datasets and evaluating performance. The discussion of the findings are thought provoking and interesting. It provides good guidance for fertilizing future work.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No significant weaknesses to comment on.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Nothing significant. I would have liked to see domain specific models in comparison also.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good complete work.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The general computer vision literature is moving towards larger models pretrained on increasingly larger scale datasets, much larger than ImageNet-1K which is still the standard approach in medical imaging. Although the role of transfer learning has been repeatedly questioned, recent results support this widespread practice especially when the size of the target dataset is small, and when using architecture with weak inductive priors such as transformers [1]. In this paper, the authors experimentally evaluate the performance of Big Transfer Model (ResNet50, BiT50 and BiT-101) and Visual Transformers (DeiT and ViT) pretrained on ImageNet-1K, ImageNet-21K and JFT when transfer learning to the medical domain. They experiment on CheXpert downsampled to 5000 images to focus on the small data regime. Results show that BiT outperforms ResNet50, and show the advantage of ViT over DeiT, which is inline with previous results [1].

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper brings a new class of models to the attention of the medical imaging community. Results can be easily applied by other researchers
    • Transfer learning from ImageNet is still the standard practice in the medical domain, especially when dataset size is small, and thus the results are relevant
    • Experiments are comprehensive and provide practical pointers e.g., related to the optimal batch size
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Experiments are reported only on one dataset (CheXPert) and only in the small data regime (5000 training images)
    • The paper is somewhat rushed and some parts need to be clarified
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper relies on publicly available pre-trained models and datasets. New splits defined but will be provided for the final version.

    The methodology is described in detail, but a few details are not very clear, in particular:

    • The description of the fine-tuning setting is not very clear (see detailed comments to authors)
    • Which pretrained models were used in the experiments and the link to retrieve them.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Major comments:

    • I agree with the authors that the small data regime is the most interesting one when exploring transfer learning from the RGB to the medical domain. However, given the large data imbalance in the CheXpert dataset, the authors should report how the dataset was sampled (if random or stratified), and what is the resulting label distribution.
    • The authors report the results on the standard split, as well as a new split with larger validation and test set. This is based on previous work by Mustafa et al. who pointed out that the standard validation set is too small to allow meaningful comparison. At the same time, it seems to me deeply unrealistic that a practitioner would train on 5K images, validate on 17K and test on 25K images. A more realistic comparison would be for instance to sample multiple test set from the 25K sequestered images to include the variability due to the test set into the confidence interval. I understand that fully solving this problem would be another paper in itself, but in my opinion it is still a limitation and should be at least addressed
    • Which pretrained models were used in the experiments? The authors mention the timm library, which does not include BiT models.
    • In Table 1, there are two models BiT-50x1, one trained on 89944 images, and one trained on “all”. I would report the actual number of images, and clarify what is the different between the two sets.
    • In Table 2, the number of iterations for REsNet50 appears to be much larger than those for BiT-* models. It is not clear why and it would be interesting to compare how the different models converge
    • The description of the fine-tuning setting is not very clear. For BiT, the papers first state that the BiT-HyperRule was used, but then concluded that training with a batch size of 32 with the same plateau scheduler as ResNet yielded better results. As a consequence, it Is not clear which protocol was used exactly for each experiment in Tables 1 and 2.

    Minor comments:

    • The confidence intervals are lacking as experiments for some models were still underway at the time of writing. The results are unlikely to change substantially, therefore in my opinion this is not a major limitation
    • The paper feels somewhat rushed with several typos. For instance deramatology -> dermatology (page 4), presumambly -> presumably (page 4), it’s small size -> its small size (page 5), smalles datasets -> smallest? Datasets (page 6)
    • Some sentences have grammar issues or lack the subject. For instance, at page 5, Split to intra: automatically created from report, and valid: 234 manually annotated X-rays

    [1] Matsoukas, C., Haslum, J. F., Sorkhei, M., Söderberg, M., & Smith, K. (2022). What Makes Transfer Learning Work For Medical Images: Feature Reuse & Other Factors. arXiv preprint arXiv:2203.01825.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This overall appears to be a good paper with rigorous methodology and clearly presented results. The results are of interest to the medical image analysis community at large. The paper would have been stronger if additional datasets were included in the analysis.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors conduct an extensive set of experiments to demonstrate the feasibility of large-scale transfer learning on chest x-ray classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The experiments are thorough - they are well designed and thoroughly executed. It well backs the papers claim and contribution.

    The topic and experiments are relevant - the network architectures (BiT, ViT) that the authors experiment with are probably relevant to the readers interest.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Lack of novelty - there is no major novelty either technical or on data contribution.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the paper may be limited due to the requirement of large compute and data requirement. The authors also do not publish their code or data used.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    ViT’s benefit from large pre-training dataset yet the authors conducted experiments on the same dataset. It would be good to obtain more data - perhaps not exactly chest x-ray but other related medical images - to see the benefit of ViT pre-training.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well organized and well written. The experiments are well constructed and thoroughly conducted. The methods are relevant and up-to-date.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The work is a study of new deep learning structures on ChexPert dataset. There is a clear lack of innovation here and the dataset is also not original. Nevertheless the reviewers think the experimental work is sold enough to justify acceptance at MICCAI.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2




Author Feedback

N/A



back to top