Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Kihyun You, Jawook Gu, Jiyeon Ham, Beomhee Park, Jiho Kim, Eun K. Hong, Woonhyuk Baek, Byungseok Roh

Abstract

A large-scale image-text pair dataset has greatly contributed to the development of vision-language pre-training (VLP) models, which enable zero-shot or few-shot classification without costly annotation. However, in the medical domain, the scarcity of data remains a significant challenge for developing a powerful VLP model. In this paper, we tackle the lack of image-text data in chest X-ray by expanding image-label pair as image-text pair via general prompt and utilizing multiple images and multiple sections in a radiologic report. We also design two contrastive losses, named ICL and TCL, for learning study-level characteristics of medical images and reports, respectively. Our model outperforms the state-of-the-art models trained under the same conditions. Also, enlarged dataset improve the discriminative power of our pre-trained model for classification, while sacrificing marginal retrieval performance. Code is available at https://github.com/kakaobrain/cxr-clip

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_10

SharedIt: https://rdcu.be/dnwxS

Link to the code repository

https://github.com/kakaobrain/cxr-clip

Link to the dataset(s)

https://physionet.org/content/mimic-cxr/2.0.0/

https://stanfordmlgroup.github.io/competitions/chexpert/

https://nihcc.app.box.com/v/ChestXray-NIHCC

https://www.physionet.org/content/vindr-cxr/1.0.0/

https://www.kaggle.com/c/rsna-pneumonia-detection-challenge

https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation

https://openi.nlm.nih.gov/faq


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper aims to enlarge the pre-training of medical images and text. To this end, this paper propose to use image-label dataset as image-text pair with prompts and utilizing multiple images and report sections. The paper further presents two contrastive learning losses, i,e, image contrastive learning and text contrastive learning. The experiments conducted on several datasets prove the effectiveness of the proposed approach.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The motivation of using image-label data to enlarge medical image-text pre-training is sound.
    2. The proposed approach outperforms previous methods on multiple datasaets.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The presentation should be improved. This article seems to be written in a bit of a hurry, and there is a lot of scope for improving the presentation of the paper.
      • There are lots of abbreviations in the main text. I recommend the authors only abbreviate the important terms.
      • For example, it is better to give a detailed description (or a full name) of TCL and ICL in the Abstract.
    2. The proposed TCL and ICL are not novel.
      • The image-only contrastive learning has been widely explored in literature.
      • The text-only contrastive learning has been widely explored in NLP and several medical image-text pre-training papers, e.g., [1].
    3. The experiment should be improved.
      • Lots of newly published works [2][3] are missing. I strongly recommend the authors compare the proposed approach with these existing works.
      • I would like to see a statistical significance test, due to the performance gap between the proposed approach and the previous state-of-the-art methods is small. Besides, the statistical significance test can eliminate the rand. impact of few-shot settings.
      • The analysis is poor. The analysis doesn’t provide insights about the contributions of each component or how that affects the final results/addresses the claimed problems (and why).
      • The paper is written in an optimistic tone that leads the reader to assume the proposed approach is rather good. However, I am more interested in knowing if the approach brings errors. And what type of errors does it bring? And why?

    References: [1] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. ECCV, 2022. [2] Advancing Radiograph Representation Learning with Masked Record Modeling. ICLR, 2023. [3] Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence, 2022.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    1. The presentation should be improved. This article seems to be written in a bit of a hurry, and there is a lot of scope for improving the presentation of the paper.
      • There are lots of abbreviations in the main text. I recommend the authors only abbreviate the important terms.
      • For example, it is better to give a detailed description (or a full name) of TCL and ICL in the Abstract.
    2. The proposed TCL and ICL are not novel.
      • The image-only contrastive learning has been widely explored in literature.
      • The text-only contrastive learning has been widely explored in NLP and several medical image-text pre-training papers, e.g., [1].
    3. The experiment should be improved.
      • Lots of newly published works [2][3] are missing. I strongly recommend the authors compare the proposed approach with these existing works.
      • I would like to see a statistical significance test, due to the performance gap between the proposed approach and the previous state-of-the-art methods is small. Besides, the statistical significance test can eliminate the rand. impact of few-shot settings.
      • The analysis is poor. The analysis doesn’t provide insights about the contributions of each component or how that affects the final results/addresses the claimed problems (and why).
      • The paper is written in an optimistic tone that leads the reader to assume the proposed approach is rather good. However, I am more interested in knowing if the approach brings errors. And what type of errors does it bring? And why?

    References: [1] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. ECCV, 2022. [2] Advancing Radiograph Representation Learning with Masked Record Modeling. ICLR, 2023. [3] Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence, 2022.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. The presentation should be improved. This article seems to be written in a bit of a hurry, and there is a lot of scope for improving the presentation of the paper.
      • There are lots of abbreviations in the main text. I recommend the authors only abbreviate the important terms.
      • For example, it is better to give a detailed description (or a full name) of TCL and ICL in the Abstract.
    2. The proposed TCL and ICL are not novel.
      • The image-only contrastive learning has been widely explored in literature.
      • The text-only contrastive learning has been widely explored in NLP and several medical image-text pre-training papers, e.g., [1].
    3. The experiment should be improved.
      • Lots of newly published works [2][3] are missing. I strongly recommend the authors compare the proposed approach with these existing works.
      • I would like to see a statistical significance test, due to the performance gap between the proposed approach and the previous state-of-the-art methods is small. Besides, the statistical significance test can eliminate the rand. impact of few-shot settings.
      • The analysis is poor. The analysis doesn’t provide insights about the contributions of each component or how that affects the final results/addresses the claimed problems (and why).
      • The paper is written in an optimistic tone that leads the reader to assume the proposed approach is rather good. However, I am more interested in knowing if the approach brings errors. And what type of errors does it bring? And why?

    References: [1] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. ECCV, 2022. [2] Advancing Radiograph Representation Learning with Masked Record Modeling. ICLR, 2023. [3] Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence, 2022.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Considering this over-complicated design of which many steps lack proper motivation, as well as the overall presentation quality, this paper is still unfinished work that requires major revision.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper tries to adopt the CLIP-based method to address medical imaging tasks. In this paper, the authors followed the structure of Multi-View Supervision (MVS) from DeCLIP, using two images and two text to form a pair of training data. The author made innovations in the construction of data pairs, which can construct data pairs from various datasets with different data compositions. In the loss function, author improved the MVS loss, and proposed ICL loss for image pair and TCL loss for text pair. Two types of experiments were conducted, “Zero-shot and few-shot classification” and “Image-to-text retrieval”. Under the same conditions, the model results are superior to SOTA. The author also points out that combine multiple datasets can improve classification results, but the text constructed by the prompt may do harm to the text retrieval.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The author followed the structure of MVS, using two images and two text to form a pair of training data, and improved the data composition in MVS. (2) The proposed data augmentation strategy can combine multiple datasets and integrate different data components. (Convert image-label pairs to image-text pairs using prompt) (3) The loss function design is improved, and the similarity between image and text, image and image, text and text is calculated. (4) The experimental design is complete, including the above two types of experiments and ablation experiments.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) The overall architecture is similar to MVS, which limits the novelty. (2) The innovation of loss function design is not great. (3) When conducting comparative experiments, the dataset of SOTA method GloRIA is not available, and the author made experiments on M, C, C14. Will this affect the persuasiveness of experiments?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    (1) No code provided. (2) The supplementary materials provide more experiment results, text templates for zero-shot classification tasks, and prompt templates for text data augmentation. (3) The model design in the paper is clear. The model would be reproducible based on the descriptions in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please address my concerns in the weakness part.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    According to my comments, my preliminary suggestion is weakly accept.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper propose to enlarge vision language data of chest x-ray from image-label data for pretraining vision language model for Chest X-ray

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea creating text from label using template is simple but proven to be effective for the classification as downstream task.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The novelty of proposed loss function is limited as it is a combination previous works[1,2,3] with some modifications.

    [1] Making the most of text semantics to improve biomedical vision–language processing. [2] A data efficient contrastive language-image pre-training paradigm. [3] Medaug: Contrastive learning leveraging patient metadata improves representations for chest x-ray interpretation.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    it seems possible to reproduce.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    it is interesting the see results of different ways of creating text from labels, such as using language model, but it maybe out of scope of this work.

    while the proposed method does not improve the image-to-text retrieval task, it is interesting to see the effect on text-to-image retrieval task as well.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    the simplicity of the idea and the results in the classification task.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The authors propose CXR-CLIP, a method for vision-language pre-training on chest x-ray images. The authors propose a combination of image-text and image-label pairs, and introduce contrastive loss terms to improve model performance. They evaluate their method on various datasets, perform an ablation study, and compare themselves against some other methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The introduction, contribution and related work section is clear. -the workflow is nicely presented in Figure 1. -The experiments consider quite a lot of datasets. -A good ablation study was performed.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -In the loss function L_CLIP in Section 3.3., I think there is a mistake. Why are there commas between u_i^T and v_j? -It would be nice to see an example of an image-text and image-label pair. -In table 3, the performance seems to drop when we train on more data (M,C,C14). Why is that? Isn’t this counter-intuitive? Please address this in the rebuttal.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The used datasets are available. However, it is unclear whether the code will be. The computational resources are indicated, and the used method well described.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please address all points listed under “weaknesses”. Moreover, please check your grammar.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method is nicely explained and thoroughly evaluated. Although the loss formulations are not novel, they are well applied to the problem of vision-language pre-training.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper presents a novel approach to enlarge the pre-training of medical images and text using image-label datasets as image-text pairs with prompts. The authors propose two contrastive learning losses, image contrastive learning (ICL) and text contrastive learning (TCL), and demonstrate the effectiveness of their approach through experiments on multiple datasets. The key strengths of the paper include the motivation for using image-label data, the proposed approach’s performance, novel data composition, data augmentation strategy, and improved loss function design. The experimental design is comprehensive, including zero-shot and few-shot classification as well as image-to-text retrieval tasks. On the other hand, weaknesses were identified in terms of presentation, comparison with recent works, lack of novelty in contrastive learning techniques, experimental setup and statistical significance tests, and potential limitations in novelty and dataset availability. Considering these limitations, the rebuttal should address how the proposed approach differs or improves upon prior works on image-only and text-only contrastive learning, the lack of comparisons with recent self-supervised works and the statistical significance of the experimental results.




Author Feedback

We greatly appreciate the reviewers for their detailed feedback and constructive critiques. We present our responses sorted by topic.

Q1. (R1, R2, R3) Novelty of contrastive learning: Our primary contribution lies in the expansion of the training image-text pairs by utilizing of image-label pairs. Furthermore, the utilization of multiple images and texts within the same study notably enhances the model to discriminate the features inherent in chest X-ray images and reports. Compared to DeCLIP[1], the composition of training data exhibits considerable differences. The employment of study-information, diverse prompts, and improved augmentations procedure collectively contribute to the model performance as shown in Table 4. While ICL and TCL designs resemble previous works [2, 3], the training procedures differ. [2, 3] limit in image-level or text-level self-supervised learning. We demonstrate that two additional supervisions work well in end-to-end manner with vision-language matching. Q2. (R1) Comparison with recent works[4, 5]: With Vit-B pre-trained on MIMIC for fair comparison, we conducted two experiment settings 1) linear probing and 2) fine-tuning whole networks. Our model performs best in linear probing and achieves competitive results with MRM[5] in fine-tuning. 1) linear probing results: Ours(V:89.3, R:89.6, S:90.2), REFERS(V:83.6, R:86.7, S:81.3), MRM(V:77.0, R:86.7, S:86.0) 2) fine-tuning results: Ours (V:91.6, R:90.3, S:92.7), REFERS(V:90.1, R:87.9, S:89.5), MRM(V:91.3, R:89.9, S:93.3) V:VinDR-CXR, R:RSNA-pneumonia, S:SIIM-pneumothorax. *Note: The data split of R differs from MRM, resulting in different performance of [5]. Q3. (R1) Statistical significance of the results: We anlayzed 4 pre-trained models with different seeds. In linear probing, the performance variances are 0.31(V), 0.15(R) and 0.33(S). These results suggest statistically significant performance gains compared to MedCLIP: 3.9(V), 0.8(R), and 3.1(S). Q4. (R2) Comparison with GloRIA: GloRIA uses image-text pairs(C) in CheXpert, which were published as image-label pairs(C) mentioned in Table 2. Our model outperforms GloRIA in both classification and retrieval tasks (except C5x200) with a similar size of pre-training image-text pairs: GloRIA:C*(21k), Our:M(22k). Q5. (R1) Analysis description of each component: Our method comprises 3 parts: 1) Using image-label dataset into image-text dataset via prompts. 2) Batch composition 3) Additional loss functions. The effects of 1) are analyzed in classification (Table 2) and retrieval (Table 3). Adding more image-label dataset improves classification but degrades retrieval performances. The effects of 2), 3) are analyzed in 4.4 ablations (Table 4), where each component of data composition and additional loss contributes to performance gains. Q6. (R1, R4) Side effect of leveraging image-label data in Table 3: Adding more image-label dataset tends to degrade the retrieval performance, because the contribution of the text in original reports was diluted. — section 4.3 Q7. (R3) Text-to-Image retrieval We evaluated image-to-text retrieval due to duplicate texts for multiple images. We deduplicated texts in test set for image-to-text retrieval, but deduplicating images for text-to-image retrieval is not trivial.

We thank R4 for pointing out the mistake about loss function. All comments and concerns will be carefully revised.

[1] A data efficient contrastive language-image pre-training paradigm. ICLR, 2022. [2] Medaug: Contrastive learning leveraging patient metadata improves representations for chest x-ray interpretation. PMLR, 2021. [3] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. ECCV, 2022. [4] Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence, 2022. [5] Advancing Radiograph Representation Learning with Masked Record Modeling. ICLR, 2023.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Based on the authors’ rebuttal and the reviewers’ comments, the paper has strengths in terms of motivation, performance, data composition, and loss function design. The authors have provided clarifications and addressed some concerns. While the authors highlight the expansion of training data and improvements compared to prior works, the exact novelty of the proposed approach is not clearly articulated. It is important for the authors to address this concern and clearly highlight the unique aspects of their approach. It is also recommended to provide a more rigorous analysis of statistical significance to strengthen the experimental results.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I agree with the reviewers that the novelty is limited and that the paper is not well written. Nonetheless, the topic is important and the experiments are rather extensive. Therefore, I recommend accept.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors propose a method for vision-language pre-training on chest x-ray images. The paper focuses on an interesting topic, but even after rebuttal I see continuing concerns on methodological novelty as the proposed techniques are standard algorithms. The focus is more on adaptations and engineering innovation to achieve good performance. Yet, there is no clear use case demonstration or general application angle. Hence, recommendation is to reject.



back to top