Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, Weidi Xie

Abstract

Foundation models trained on large-scale dataset gain a recent surge in CV and NLP. In contrast, development in biomedical domain lags far behind due to data scarcity. To address this issue, we build and release PMC-OA, a biomedical dataset with 1.6M image-caption pairs collected from PubMedCentral’s OpenAccess subset, which is 8 times larger than before, PMC-CLIP covers diverse modalities or diseases, with majority of the image-caption samples aligned at finer-grained level, {\em i.e.}, subfigure and subcaption. While pretraining a CLIP-style model on PMC-OA, our model named PMC-CLIP achieve state-of-the-art results on various downstream tasks, including image-text retrieval on ROCO, MedMNIST image classification, Medical VQA, i.e., +8.1% R@10 on image-text retrieval, +3.9% accuracy on image classification.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43993-3_51

SharedIt: https://rdcu.be/dnwNW

Link to the code repository

https://github.com/WeixiongLin/PMC-CLIP

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The manuscript presents an automatic pipeline to construct a high-quality image-text biomedical dataset from scientific papers, called PMC-OA, which contains 1.6 M image-text pairs covering a wide scope of diagnostic procedures and diseases. The authors also pre-train a vision-language model, PMC-CLIP, on PMC-OA and evaluate its performance on various downstream tasks, including medical image-text retrieval, medical image classification, and medical visual question answering.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

the proposed PMC-CLIP dataset with 1.6 M image-text pairs can serve as a foundation for further research in the biomedical domain

Extensive comparison versus the state of the art datasets
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

No major weaknesses. Some minor comments are suggested below (section 9).
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

codes and data are not available but comparison were made to publicly available datasets.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

given that images are selected from publications, please comment about the impact of the quality of the figures on the performance of the pre-trained model.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

the proposed PMC-CLIP dataset with 1.6 M image-text pairs can serve as a foundation for further research in the biomedical domain
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper introduces PMC-CLIP, a pre-trained model for biomedical language-image contrastive pre-training. It also presents PMC-OA, a biomedical dataset with 1.6M image-caption pairs collected from PubMedCentral’s OpenAccess subset. The main claim of the paper is that it outperforms previous models on various downstream tasks, including image-text retrieval, image classification, and medical VQA.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposal of an automatic pipeline to construct high-quality image-text biomedical datasets from scientific papers, which can be continuously updated.
2. The creation of a large-scale biomedical dataset, PMC-OA, with 1.6M image-caption pairs collected from PubMedCentral’s OpenAccess subset, which is 8 times larger than before and covers diverse modalities or diseases.
3. The pre-training of a vision-language model on PMC-OA, termed as PMC-CLIP, to serve as a foundation model for biomedical domain.
4. The thorough evaluation of PMC-CLIP on various downstream tasks (retrieval, classification, and VQA), and demonstration of state-of-the-art performance.
5. The release of the dataset and pre-trained model to the community for further research and development in the biomedical domain.
6. The paper is well written.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Minor comments as the approach is trained using known techniques, i.e., a combination of standard image-text contrastive (ITC) loss and masked language modeling (MLM) to encourage joint interaction of image and text.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors will provide the necessary code in a repository and the dataset will be made public. Thus, the approach should be fully reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The manuscript introduces a unique method for generating an extensive biomedical dataset and training a pre-existing model for biomedical language-image contrastive pre-training. This well-founded approach tackles some of the shortcomings observed in earlier studies in the field. The authors offer comprehensive explanations of both the dataset generation process and the training methodology for PMC-CLIP, facilitating the replication of their findings. The pretrained model is compared against other approaches and outperforms them in a variety of tasks including classification and retrieval. Ablation studies are extensive and highlight the importance of the pretraining task.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Well annotated and larger data sets in the medical domain are needed. This work could be fundamental for future developments.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The contribution of this paper is the construction and release of PMC-OA, a biomedical dataset that contains 1.6 million image-caption pairs collected from the OpenAccess subset of PubMedCentral. PMC-OA is 8 times larger than the previous dataset and covers diverse modalities and diseases. The majority of the image-caption samples in PMC-OA are aligned at a finer-grained level, i.e., subfigure and subcaption.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The main strengths of the paper are as follows:
1. The paper proposes a PMC-CLIP that achieves state-of-the-art performance on several benchmarks.
2. The paper provides a biomedical dataset with 1.6 million image-caption pairs collected from the OpenAccess subset of PubMedCentral.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The main weaknesses of the paper are as follows:
1. The author claims to be the first to integrate subfigures separation, subcaptions separation, and alignment into the data collection pipeline, but the advantages of these steps need to be clearly described.
2. The paper claims that there is no duplication with the ROCO dataset, but it is not clear how the author guarantees that PMC-OA is not duplicated with the ROCO dataset.
3. The author claims “The fairness on population ensures our dataset sightly suffers from patient characteristic bias, thus providing greater cross-center generalize ability.” Such claim should be discussed.
4. The paper is not well-organized. It is confusing whether the paper proposes a dataset or a method. If it is a dataset, there should be enough space to describe the dataset’s advantages, and the corresponding ablation experiment is not only the ablation of two losses, then the title is obviously not appropriate. If it is a method, there is no significant difference between PMC-CLIP and the existing method of comparative learning multimodal learning, and the ablation experiments are conducted superficially.
5. Table 3 is not mentioned in manuscript. In addition, there is no explanation of the comparison method. The citation/ref formats are not consistent, such as “Table 4” and “Tab. 5”
6. The paper does not explain the methodology and experimental procedures in detail, making it difficult to assess the validity and reliability of the results. Moreover, the paper does not discuss the limitations of the research.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

More information and details about the experimental setup need to be provided for better reproducibility. Specifically, the paper has provided fewer experimental details for downstream tasks, such as VQA. In the supplementary material, the answer prompt and question prompt are not yet clear, further hindering the ability to replicate the experiment.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

There are some suggestions for improvement, see weakness for additional details. • The author used only ResNet50 as visual encoder, which can be expand to multiple versions. • please provide a more thorough literature review that covers the most relevant and recent works in the field. This will help to justify the significance and novelty of this work. • In the methodology section, please provide more details and explanations about the experimental design and data collection procedures. This will help to assess the validity and reliability of the results. It is recommended to add a more comprehensive description and statistics of the dataset. • In the discussion section, please discuss the limitations and implications of the research.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes a PMC-CLIP that achieves state-of-the-art performance on several benchmarks, with a large dataset. It has some shortcomings in terms of providing sufficient details. Furthermore, the organization of the paper are a bit confusing.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

the authors’s rebuttal provide additional response to my comments. I increase my score to 5.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper introduces a foundational model tailored for the medical domain, showcasing its applicability across various downstream tasks. An extensive evaluation of the model is conducted, encompassing multiple tasks. However, there are several concerns raised by reviewers that should be addressed for further improvement: (1) The author claims to be the first to integrate alignment into the data collection pipeline. However, the advantages and benefits of these steps need to be clearly elucidated in the paper. Providing a detailed explanation of how these integrations enhance the overall performance and effectiveness of the model would strengthen the claims made. (2) Reviewers have requested additional information and specifics regarding the experimental setup. In order to ensure better reproducibility, it is crucial to include prompt details, downstream task details, and a comprehensive description of the method employed. By providing these necessary details, researchers will have a clearer understanding of the experimental framework and will be able to replicate and validate the results. (3) It is recommended to better contextualize the paper with existing literature in the field. Discussing and referencing relevant prior work will provide a stronger foundation for the proposed model, highlighting its novelty and contributions to the existing body of knowledge.

Author Feedback

We thank all reviewers for their constructive comments and include the response below. All codes and models will be released for reproducibility. In summary, all reviewers recognize the value of our proposed dataset and model, namely, PMC-OA and PMC-CLIP, “The proposed …… can serve as a foundation for further research in the biomedical domain”, “Well annotated and larger data sets in the medical domain are needed……”, “…… achieves state-of-the-art performance on several benchmarks, with a large dataset.” The concerns arise from R#3 mainly about reproducibility, data collection pipeline, literature review and other aspects.

Q: Benefits of subfigure separation, subcaption separation and alignment A: For unprocessed compound figures, the image-text pairs are coarsely aligned with ambiguity. For example, two subfigures with opposite results may be put together, so are their descriptions. A fine-grained alignment should thus boost the performance of pretraining. In addition, subfigure separation can also operate as an ‘augmentation’ of the compound figures and captions. We do ablation study to validate such claim: | Method | I2T(R@10) | T2I(R@10) | | compound figure - full caption | 48.95 | 44.54 | | subfigure - full caption | 70.16 | 66.35 | | PMC-CLIP | 71.88 | 69.69 |

Q: Experiment setup A: We will add more experiment details in revision. Here, we briefly explain the process for deploying PMC-CLIP to VQA (Fig.4). Specifically, we freeze the pretrained visual and textual backbone, and only train a fusion module (2-layer transformer encoder) with learnable prompts. Given a triplet of image, question and answer (I, Q, A), image/text features (v_i, v_q, v_a) are extracted respectively. We concatenate learnable tokens (token_q, token_a) with the the question/answer embedding: ([v_q, token_q], [v_a, token_a]). The fusion module takes (v_i, [v_q, token_q]) as input and gives answer prediction v_p. We then choose the answer that emits the highest similarity score with v_p.

Q: Literature review A: As discussed in Sec.1, biomedical domain vision-language pretraining (VLP) lags far behind that in general domain, mainly because of the scarcity of datasets. Existing biomedical image-text datasets have apparent drawbacks such as limited scale[ref.27, ref.31], covering only single[1, 2] or few modalities[ref.27], or in form of compound figures[ref.31]. It consequently hindered the performance of models pretrained on them[ref.3, 3, 4, 5, 6]. Our research on PMC-OA and PMC-CLIP thus aims to lay a foundation for future study in this field, as also mentioned by R#1 and R#2. In biomedical domain, we recognize a lot of progress in VLP methodology. For example, [5] combines the global-local alignment, [6] decouples image and texts in contrastive learning, [7] enhances the model with domain knowledge. There methods are orthogonal to our contributed PMC-OA dataset. [1] Chexpert, AAAI 2019 [2] Mimic-cxr, Scientific data 2019 [3] ConVIRT, MLHC 2022 [4] PubMedCLIP, EACL 2023 [5] GLoRIA, ICCV 2021 [6] MedCLIP, EMNLP 2022 [7] MedKLIP, ArXiv 2023

Q: Limitations A: Our work is a pilot study on biomedical foundation model and certainly has limitations. First, since academic literatures are generally expected to explore unknown fields, the atypical symptoms and rare diseases might be mentioned more than in clinical practice. Second, the images we collect are in 2D. PMC-CLIP’s generalization ability to 3D data is treated as our future work.

Q: Ablation on Visual backbone A: We switch RN50 to RN101, RN50x4 and ViT-B/32, following setting of CLIP. Evaluation on ROCO shows that all ResNet variants have close performance with RN50, outperforming ViT-B/32, potentially due to the large patch size. | Visual Encoder | I2T(R@10) | T2I(R@10) | | RN50 | 71.88 | 69.69 | | RN101 | 71.82 | 71.14 | | RN50x4 | 71.77 | 71.96 | | ViT-B/32 | 64.42 | 64.77 |

Q: Deduplicate with ROCO A: We identify the images by using paperID and img link in ROCO.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper presents a medical domain foundation model, demonstrating its effectiveness in various downstream tasks. While the reviewers have recognized the strengths of the work, they have also identified some areas for improvement. However, the authors have effectively addressed concerns related to experimental setup details, related works, and limitations. Considering the paper’s intriguing nature and its contributions, I recommend accepting the work.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Based on the strengths and weaknesses identified, it is evident that the paper makes valuable contributions to the biomedical domain by introducing new datasets and showcasing the applicability of the proposed model. While there are concerns raised by reviewers, the authors have addressed these concerns in the rebuttal and have committed to providing additional details for better reproducibility. The positive feedback from the reviewers and the recognition of the paper’s contributions justify accepting the paper with minor revisions. The authors should incorporate the necessary improvements, including providing detailed explanations of the benefits of alignment, additional experiment details, and a stronger contextualization with existing literature.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This manuscript introduces PMC-CLIP, a pre-trained model designed specifically for biomedical language-image contrastive pre-training. Moreover, it presents PMC-OA, a comprehensive biomedical dataset comprising 1.6 million image-caption pairs, curated from the OpenAccess subset of PubMedCentral. In their rebuttal, the authors effectively address queries regarding the benefits of subfigure and subcaption separation and alignment, along with clarifying the experimental setup details, and refining the literature reviews. I am of the opinion that this work represents a valuable contribution to the field and could serve as a foundational basis for future research within the biomedical domain. As such, I advocate for the acceptance of this paper. Upon acceptance, I encourage the authors to incorporate the discussions presented in the rebuttal into the final manuscript.

back to top

PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents