Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Philip Müller, Georgios Kaissis, Congyu Zou, Daniel Rueckert

Abstract

Self-supervised pre-training on unlabeled images has shown promising results in the medical domain. Recently, methods using text-supervision from companion text like radiological reports improved upon these results even further. However, most works in the medical domain focus on image classification downstream tasks and do not study more localized tasks like semantic segmentation or object detection. We therefore propose a novel evaluation framework consisting of 18 localized tasks, including semantic segmentation and object detection, on five public chest radiography datasets. Using our proposed evaluation framework, we study the effectiveness of existing text-supervised methods and compare them with image-only self-supervised methods and transfer from classification in more than 1200 evaluation runs. Our experiments show that text-supervised methods outperform all other methods on 13 out of 18 tasks making them the preferred method. In conclusion, image-only contrastive methods provide a strong baseline if no reports are available while transfer from classification, even in-domain, does not perform well in pre-training for localized tasks.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_62

SharedIt: https://rdcu.be/cVRzg

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a downstream evaluation framework with 18 localized tasks on chest X-rays, including object detection and semantic segmentation on five public datasets. The authors conduct a comparative study of pre-training methods, including text-supervised and image-only contrastive methods. The authors pre-train their models on MIMIC-CXR and evaluate the studied methods on their localized chest X-ray evaluation framework.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors demonstrate that text-supervised methods outperform all other methods on 13 out of 18 tasks and are less sensitive to the downstream dataset size on some tasks. The authors show that transfer from classification does not perform well and common supervised classification methods seem to be unable to utilize image labels effectively for localized downstream tasks. The authors provide a good justification for the results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The methodology is not novel in general but the authors provide comprehensive evaluation framework for 18 localized tasks.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors employ well-known datasets and architectures. It seems that the paper is reproducible though the authors do not provide code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The authors study the effectiveness of existing text-supervised methods and compare them with image-only self-supervised methods. The authors did a good job of evaluating the text-supervised methods to contrastive methods and in-domain and cross-domain transfer from classification methods. The results and justification look reasonable and might be useful to the community. It would be interesting to see the results for the deeper backbone for UNet and the higher resolution of the input image.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Technical novelty, reproducibility, and results achieved.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    The authors propose a evaluation framework consisting of 18 localized tasks, including semantic segmentation and object detection, on five public chest radiography datasets. They test many different SOTA self- or text-supervised methods in many downstream tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    I appreciate the huge amount of experiments in this paper. They show the effectiveness of existing text-supervised methods and compare them with image-only self-supervised methods and transfer from classification. The experimental results show the text-supervised methods outperform all other methods on 13 out of 18 tasks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I think this paper is more like a technical report but not a scientific paper. The authors spend lots of time conducting tons of experiments to show the effectiveness of existing self- and text-supervised methods. However, I do not see any new methods in this paper. I think it is also okay to analyze the effectiveness of existing methods in different settings. But the authors should give a detailed analysis that why the text-supervised method is better and how the radiological reports help to improve the performance. The conclusion in this paper does not focus on the medical images, and the authors should analyze the differences in the performances of these self- and text-supervised methods in natural and medical images.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    All the methods and datasets are public in this paper. I think it can be reproduced since the existing methods have open-source codes.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please add a detailed analysis that why the text-supervised method is better and how the radiological reports help to improve the performance. For instance, show some cases with original images and radiological reports, analyzing which parts of the radiological reports helps. Or, using some grad-cam method to show the attention regions of radiological reports. Meanwhile, please show analyze the differences in the performances of these self- and text-supervised methods in natural and medical images.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novelty is limied. No new method is proposed in this paper. They just analyze the the effectiveness of existing self- and text-supervised methods in different tasks, and do not analyze that why the text-supervised method is better and how the radiological reports help to improve the performance.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    I still think the novelty is limited. I appreciate the huge amount of experiments in this paper, but it is more important to provide a thorough analysis about the guidance from radiological reports for the network training.



Review #4

  • Please describe the contribution of the paper

    This paper studies the performance of different self-supervised pre-training methods on localized imaging tasks on chest x-rays. In this paper, two types of self-supervised pre-training methods are studied, including contrastive visual representation learning and text-supervied learning. Three evaluation protocols including fine-tuning, backbone frozen, and linear evaluation are employed. Extensive experiments on five datasets show the advantages of text-supervised learning over contrastive learning methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The comparison of the performance of contrastive and text-supervised on localized imaging tasks such as sementic segmentation and object detection is interesting. It is inspiring to show that text-supervision is even better than contrastive supervision in most of the tasks.
    • The experiments are extensive and includes several datasets.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • While this paper provides detailed comparison between different self-supervision methods, no principled analysis is provided for choosing the best pre-training methods for a given task. For example, given pre-training dataset A and downstream task B, how to choose the self-supervision methods based on the characteristics of A and B?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The hyperparameters of the experiments are described in detail, which are sufficient to reproduce the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Could you explain more on the sensitivity of pre-training methods to the size of the downstream tasks shown in Fig. 1? If I understand this figure correctly, the results shown is the performance on 1\% or 10\% data of the downstream task relative to the full data. However, it is not clear why this metric is important. Providing an example could better explain the importance of this sensitivity.
    • It could be better to show the results of combined contrastive and text-supervised learning. Does this combination improve the performance? If yes, does it always outperform each of them in the studied datasets?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is a comprehensive and extensive study of contrastive supervision and text-supervision for pre-training on unlabeled dataset. By providing the detailed comparisons, this paper has the potential to inspire more principled studies on self-supervised learning for localized medical images tasks.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Your paper received better scores from Reviewers 1 and 4 than from Reviewer 2, but they are actually very critical of your paper in novelty, systematic analysis and insights, experiments, and writing. Intuitively, with additional supervision from radiological reports, a performance gain is expected, but as the reviewers asked, a thorough analysis backed up with evidence would be essential. Your experiments need to be improved as reviewers pointed out. Furthermore, your baselines might not be the SoTA, for example, in the Pneumothorax Segmentation task, you reported 34.2 (Dice %) for the model pre-trained on CheXpert, and you obtained 44.0 with radiological reports, but the SoTA should be above 69.00; also, you used NIH ChestX-Ray8 rather than NIH ChestX-Ray14, hampering an apple-to-apple comparison with the SoTA. You may want to review the literature and list the SoTA performance for each of your target tasks. I feel that it may be challenging to overcome the criticisms via rebuttal, but I still want to give you an opportunity to do so.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3




Author Feedback

The aim of this paper is to provide a comparison of pre-training methods on a standardized evaluation framework for localized medical tasks. As R1, R2, R4 noted, we do not propose a new method but instead provide extensive quantitative insights on how well different pre-training methods perform on different types of downstream tasks. We support these results via multiple experimental runs which enables quantification of the variances. We point out that such studies are an important part of scientific progress and provide valuable insights for researchers that either want to use pre-training in applied models or want to further develop pre-training methods. We think that our work is very valuable for practitioners in the medical domain as i) localized tasks are very important and ii) there are no standardized studies on self- and text-supervised pre-training for localized tasks. R2 also asked for an analysis why text-supervision is superior. Compared to labeled training, the reasoning is not that text-supervision is better but that text is often available as annotation in medical practice while (localized) labels are expensive to acquire. Compared to unsupervised pre-training, text provides additional supervision which we showed is beneficial (e.g. comparing SimCLR with ConVIRT). We like the idea of R2 to study the relevance of different parts and showing grad-cam figures. However, here we focused on the extensive quantitative analysis (which we and some reviewers consider the main strength of our work) and decided to leave qualitative studies to future work, especially considering the large number of datasets and limited space of the paper. We thank R2 for proposing to focus more on the medical images in the conclusion and will add this in the camera-ready version.

We agree with R4 that an analysis for choosing the best pre-training method for a given task is valuable. We provided an analysis on p. 7 and gave some general hints on preferred methods on p. 6-7. We also showed the properties of different studied datasets on p. 4 such that readers can focus on the results that are most relevant for their work. R4 asked for the reasoning behind Fig.1. We included it to visualize the sensitivity to the size of the downstream dataset. In practice, one would not reduce the size of the dataset, however in some cases only smaller datasets are available and we therefore studied how the methods perform in such cases. We like the idea of R4 to combine self-supervised with text-supervised learning and consider it interesting future work but also out-of-scope of our work.

We would like to point out that the meta-reviewer (M1) compared our results with results typically reported on the used datasets and might thus have taken our results out-of-context. Works that achieve SOTA-results on the downstream datasets typically focus on that single task and optimize their hyperparameters for that single task. This makes a comparison of different pre-training methods on different tasks very hard. In order to allow for a standardized comparison, we therefore restricted downstream tuning and architectures (see discussion section). We especially restricted the input image sizes to a very small size to reduce computational resources (and therefore the environmental impact). This restrictive setup reduces the risk of bias towards a pre-training method and allows for a computational efficient comparison of methods. We argue that pre-training only changes the initialization of downstream models and therefore our findings are also valid for models optimized for a single task.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The performance reported in the paper is significantly below the state of the art, and the argument for such poor performance in the rebuttal is not scientifically convincing, as there is no reason to restrict how to fine-tune the pretrained models on downstream tasks. If a method significantly underperforms the state of the art, it would offer little value to our research community. I believe that this paper could become a great paper if the author can improve the experiment setups and demonstrate the state-of-the-art performance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I agree with the reviewers that this paper does not have technical novelty. I also agree with the two reviewers who found the extensive set of experiments to be of value. To be more specific, I believe that the comparisons in this paper will guide future work on using text reports in radiological reports as a form of supervision, and in particular local supervision.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal has been received very well and one of the reviewers has stated they want to increase the rating to 6 from 5.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3



back to top