Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Xiaofei Chen, Yuting He, Cheng Xue, Rongjun Ge, Shuo Li, Guanyu Yang

Abstract

The foundation models based on pre-training technology have significantly advanced artificial intelligence from theoretical to practical applications. These models have facilitated the feasibility of computer-aided diagnosis for widespread use. Medical contrastive vision-language pre-training, which does not require human annotations, is an effective approach for guiding representation learning using description information in diagnostic reports. However, the effectiveness of pre-training is limited by the large-scale semantic overlap and shifting problems in medical field. To address these issues, we propose the Knowledge-Boosting Contrastive Vision-Language Pre-training framework (KoBo), which integrates clinical knowledge into the learning of vision-language semantic consistency. The framework uses an unbiased, open-set sample-wise knowledge representation to measure negative sample noise and supplement the correspondence between vision-language mutual information and clinical knowledge. Extensive experiments validate the effect of our framework on eight tasks including classification, segmentation, retrieval, and semantic relatedness, achieving comparable or better performance with the zero-shot or few-shot settings. Our code will be open-sourced. Our code is open on https://github.com/ChenXiaoFei-CS/KoBo.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_39

SharedIt: https://rdcu.be/dnwcQ

Link to the code repository

https://github.com/ChenXiaoFei-CS/KoBo

Link to the dataset(s)

[1] CheXpert: CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison: stanfordmlgroup.github.io

[2] MIMIC-CXR: MIMIC-CXR-JPG - chest radiographs with structured labels v2.0.0: physionet.org

[3] Covid cxr v2: COVIDx CXR-2, Kaggle

[4] SIIM-ACR: SIIM-ACR Pneumothorax Segmentation, Kaggle

[5] UMNSRS: Semantic Relatedness and Similarity Reference Standards for Medical Terms: umn.edu


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a novel knowledge-boosting medical contrastive vision-language pre-training framework, to tackle the problem of semantic overlap problem and semantic shifting problem. Extensive experiments on several downstream tasks validate the effectiveness of the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Different from previous pre-training frameworks, the authors enriched the medical semantic information of the samples using domain knowledge, which can suppress the misclassification of false-negative samples, unify the representation of textual semantics, and enhance the semantic interaction between visual and textual modalities, thereby solving the problems of semantic overlap and semantic transfer.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Figure 3 has poor readability, and the arrangement of the arrows inside the figure is quite messy.
    2. At the end of section 2.1, instead of explaining important symbols inside the tuple, a reference is directly cited, which is not clear enough for readers.
    3. In the second paragraph of section 2.2, the No Finding embedding and a variant of domain knowledge embedding are randomly initialized. This design seems not very reasonable and it is difficult to ensure that these codes can express the opposite information of these semantics, i.e. normal information.
    4. The symbol T in equation 2 represents the transpose operation, but it looks similar to other symbols with T (such as IT and Text), which may cause ambiguity.
    5. There is skepticism about the claim made in section 2.3 that only using domain knowledge embedding to create a simple attention mechanism can represent unbiased anchors in semantic space.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This paper is reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    In the second paragraph of section 2.2, the No Finding embedding and a variant of domain knowledge embedding are randomly initialized. This design is not very reasonable and it is difficult to ensure that these codes can express the opposite information of these semantics, i.e. normal information. Therefore, it is recommended to use more effective initialization methods, such as clustering the features of normal samples and selecting the cluster centers to initialize the normal features.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a novel visual-textual representation learning method that differs from previous research by incorporating domain knowledge to enhance the interaction of vital information between visual and textual modalities.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Although this work has some shortcomings, I believe that it is a solid work and the feedback addresses most of the issues. Therefore, I maintain the original rating.



Review #2

  • Please describe the contribution of the paper

    The paper introduces a Contrastive Vision-Language Pre-training (VLP) framework that incorporates knowledge graphs to address semantic overlap and semantic shifting issues, which could arise when using only images and text modalities in CXR studies. The proposed model is trained on the MIMIC-CXR dataset and evaluated on eight downstream tasks. Results indicate that the proposed model outperforms baseline VLP methods, demonstrating its effectiveness in enhancing performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -The paper highlights two domain-specific problems in VLP for medical images, which is a valuable contribution.

    -The introduction of knowledge graphs to address these problems is a novel and interesting approach.

    -The experiments are well-designed and the results are strong.

    -Overall, the paper is well-structured and easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Weaknesses:

    -The lack of details regarding the constructed knowledge graph is a potential limitation, although this may be due to page limitations.

    -The inclusion of confidence intervals would strengthen the results and make them more convincing.

    -The comparison in Fig. 4(b) to the ImageNet baseline is expected, but it would be more informative to also compare the proposed model to other VLP models, such as ConVIRT and GLoRIA.

    Minor issues: -Section 2.1 should use “distinct projectors” instead of “district.”

    -The meaning of N_e is not defined in section 2.1 and should be clarified.

    -Fig. 3(a) has too many crossed lines, making it difficult to read.

    -The CAM in Fig. 5 is diffused and lacks focus. Additionally, the text stating that “right pneumothorax is resolved” suggests that there should not be attention paid to the right lung, since it is “resolved”.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the paper is credible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    To strengthen the evaluation, providing confidence intervals to the main results would be beneficial. Additionally, the paper could provide more detailed information about the knowledge graph used in the proposed method. It may also be worthwhile to conduct an ablation study on the density of the graph and its impact on pre-training performance. For example, randomly removing a certain percentage of the links and observing the effect on performance would be a useful analysis to include.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper’s incorporation of a knowledge graph into a VLP framework is an interesting approach with meaningful and well-explained motivations. However, to increase the credibility of the proposed method, the paper could benefit from additional details about the knowledge graph itself, including its construction and the specific techniques used to incorporate it into the VLP framework. Additionally, a more comprehensive analysis of the impact of the knowledge graph on both pre-training and downstream tasks would be valuable.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes the KoBo framework, which integrates clinical knowledge into pre-training models for medical diagnosis. The authors have conducted extensive experiments on eight tasks and achieved comparable or better performance than existing methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The innovative approach of integrating clinical knowledge into pre-training models.
    2. The conducted experiments are comprehensive and demonstrating the effectiveness of the proposed framework.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The designed framework is complicated. The paper lacks clarity in explaining the technical details of the proposed framework, which makes it difficult for readers to understand the methodology.
      • There are many notations that hinder the reader from understanding the framework accurately and efficiently.
      • What is the triplet G?
      • typo: “District projectors” should be “Distinct projectors”
    2. The motivation of the proposed framework should be clarified. This paper introduces lots of modules and losses therefore, it is important to give the insights and the motivation of each module.

    3. The novelty of the proposed approach is limited.
      • The idea is similar to the existing work [1]. What are the main differences between this work and [1].
    4. How to prove the proposed approach can address the claimed problems, i.e., Semantic Overlap Problem and Semantic Shifting Problem? Could you provide evidence, e.g., human evaluation, to prove it?

    5. Secondly, the authors have not compared their results with state-of-the-art methods [2][3][4], which may weaken the significance of their contributions.

    References: [1] MedCLIP: Contrastive Learning from Unpaired Medical Images and Text. EMNLP 2022. [2] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. ECCV, 2022. [3] Advancing Radiograph Representation Learning with Masked Record Modeling. ICLR, 2023. [4] Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence, 2022.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good. The key hyper-parameters are introduced. However, the technical details should be clarified.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. The designed framework is complicated. The paper lacks clarity in explaining the technical details of the proposed framework, which makes it difficult for readers to understand the methodology.
      • There are many notations that hinder the reader from understanding the framework accurately and efficiently.
      • What is the triplet G?
      • typo: “District projectors” should be “Distinct projectors”
    2. The motivation of the proposed framework should be clarified. This paper introduces lots of modules and losses therefore, it is important to give the insights and the motivation of each module.

    3. The novelty of the proposed approach is limited.
      • The idea is similar to the existing work [1]. What are the main differences between this work and [1].
    4. How to prove the proposed approach can address the claimed problems, i.e., Semantic Overlap Problem and Semantic Shifting Problem? Could you provide evidence, e.g., human evaluation, to prove it?

    5. Secondly, the authors have not compared their results with state-of-the-art methods [2][3][4], which may weaken the significance of their contributions.

    References: [1] MedCLIP: Contrastive Learning from Unpaired Medical Images and Text. EMNLP 2022. [2] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. ECCV, 2022. [3] Advancing Radiograph Representation Learning with Masked Record Modeling. ICLR, 2023. [4] Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence, 2022.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper presents an innovative approach to pre-training models and provides comprehensive experimental results. However, the authors need to address the aforementioned weaknesses to strengthen the paper.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The paper proposes a medical contrastive vision-language pre-training (VLP) framework, named KoBo, with the goal of incorporating additional knowledge to maximize semantic consistency between paired image and text features. The proposed method consists of two modules: (1) The KSE module measures the negative sample noise by calculating the sample-wise similarity between estimated knowledge embedding; and, (2) The KSG module adjusts the semantic shifting during pre-training by fusing domain-sample knowledge with global-local modality embeddings. To do so, the KSG module leverages four sub-modules, namely Knowledge Anchor Guidance, Semantic Knowledge Refinement, Vision Semantic Response, and Semantic Bridge Guidance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well-organized.

    The results show improvement for semantic relatedness and segmentation in a few-shot setting.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1)The paper’s novelty is limited, and its original contributions require further investigation. While the paper focuses on enhancing vision-language pretraining using knowledge, it overlooks previous studies [1] that proposed the use of domain expert knowledge from UMLS to enhance such models.

    (2)The proposed method combines 5 losses, resulting in increased complexity and dependence on several hyperparameters, raising concerns about optimization’s difficulty and the generalizability of the method to other applications.

    (3)The method is not well-described and crucial details are missing. It is unclear how KSE and KSG are integrated; details of constructing the knowledge graph with UMLS, pretraining graph encoder, and extracting concept sets with NegBio are missing, raising concerns about the reproducibility of the method.

    (4)Experiment setups are not well-established and the results may be insufficient to demonstrate the efficacy of the proposed method. (4-1)Despite the paper’s claim on the state-of-art performance, there is a lack of comparison with SOTA methods such as BioViL [2] and most related works [1]. (4-2)Despite the increased complexity in using extra knowledge, the proposed KoBo’s performance (with ResNet-50 backbone as baselines) in both retrieval tasks is worse than the baselines and provided gains in the remaining tasks (e.g. UMNSRS and CLS(V) on Chexpert) are marginal. Moreover, there is no statistical analysis (i.e. p-value) to demonstrate the significance of the gains. (4-3)The experiments do not adhere to common and standard evaluation protocols, such as those in ConVIRT, GLoRIA, and BioViL. (a) The evaluations on vision tasks are limited to frozen encoder with 1% of data, while the results for different portions and full training data, which can demonstrate the data efficiency of the proposed method, are missing. (b) Fine-tuning results for different tasks are missing. ( c) Performance of image-text retrieval tasks under different precision thresholds is missing. (4-4) While the paper aims to improve semantic alignment between image and text, it has not been evaluated in phrase grounding task, which is a standard evaluation in the vision-language paradigm [2].

    (5) Ablation experiments are not comprehensive and have certain flaws. (5-1) The ablation on KSE and KSG is limited to one task. Moreover, performance gains offered by KSG appear to be marginal and there is no p-value to verify its true impact. (5-2) There is no ablation on four sub-modules of KSG. (5-3) In Fig. 4 and 5, KoBo is just compared with ImageNet, which is not a competitive baseline, while there is a lack of comparison with SOTA medical vision-language models. (5-4) t-SNE plot is limited to just 1 disease, while the rest four diseases in ChexPert are missing.

    [1]Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge, 2022 [2]Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing, 2022

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is not clear how many times each method has been run. The standard deviation and statistical analysis are not reported.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    -The authors may consider a more appropriate visualization for the methodology figure as its current format is confusing and hard to understand.

    -It is not clear why the performance of CLIP on SIIM in table 1 is not reported.

    -The authors developed a new semantic relatedness benchmark is generated from MIMIC-CXR but the details for this benchmark and its capabilities in demonstrating the effectiveness of the proposed method in semantic relatedness is unclear.

    -Fig. 5 may not demonstrate the efficacy of the proposed method in learning fine-grained and effective image features, and it is only limited to one image. It is recommended to conduct quantitative analysis on phrase grounding tasks.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of the paper and its original contributions are not clear. Furthermore, the proposed method exhibits additional complexity while either underperforming compared to the baselines or leading to only marginal performance gains. The experiments conducted are not well-established, and the ablation experiments are not comprehensive enough. For more information, please refer to the weaknesses and detailed comments sections.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    The rebuttal fall shorts in fully addressing the main concerns regarding the arbitrary experimental setup, complexity of the proposed method which may lead to optimization’s difficulty and negatively impact its generalizability, inferior performance in retrieval tasks and marginal gains in some other tasks, and lack of ablation experiments to demonstrate the impact of each module.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper presents a novel knowledge-boosting medical contrastive vision-language pre-training framework that aims to address the semantic overlap problem and semantic shifting problem in medical diagnosis. The authors introduce domain knowledge into the pre-training process to enrich the medical semantic information of the samples, which helps suppress misclassification of false-negative samples, unify the representation of textual semantics, and enhance the semantic interaction between visual and textual modalities. The paper provides extensive experiments on several downstream tasks, demonstrating the effectiveness of the proposed method. The main strengths of the paper include the incorporation of domain knowledge to enhance the interaction between visual and textual modalities, which is a valuable contribution. The paper also highlights the problems specific to VLP for medical images and proposes a novel approach to address them. The experiments are well-designed and demonstrate the superiority of the proposed method over baseline VLP methods. Additionally, the paper is well-structured and easy to follow. However, there are several weaknesses that should be addressed including improving the clarity of the introduced notation and the readability of figures, using more appropriate initialization methods, and providing evidence or a more detailed explanation for the claimed benefits of the proposed approach as well as statistical analysis. Additionally, the authors should consider further motivating the proposed framework and comparing their results with state-of-the-art methods in the field to strengthen the significance of their contributions.




Author Feedback

We sincerely thank all ACs and reviewers for their highly positive appreciation.

Great Novelty (R2-“novel and interesting”, R3-“innovative”) Our KoBo opens up a novel paradigm of knowledge-driven foundation model pre-training (Fig.1), innovatively handling two semantic-profound challenges in the biomedical domain with the boosting of the external knowledge base.

Q1: Novelty vs MedCLIP[EMNLP2022] (R3) Our KoBo is able to generalize to diseases covered by domain knowledge (10,244 concepts), while MedCLIP is unable to learn beyond the sentence labels of 14 (fixed) diseases.

Q2: Novelty vs Med-VLP[ACMMM2022] (R4) Our KoBo is able to cope with semantic shifting problem in medical reports with our sample knowledge embedding, while Med-VLP is unable to handle the problem with simply embedding from GCN.

Valuable Motivation (MR-“specific to VLP”, R2-“valuable”) This paper targetedly handles semantic overlap and semantic shifting during VLP, with a unified design of knowledge modeling.

Q3: Complexity and motivation of five losses (R3,R4)

  • The functions of our five losses complement each other. KSE reduces semantic overlap, KAG reduces disperse shifting, SKR reduces converging shifting, VSR consolidates knowledge guidance and SBG reduces modality gap (Sec2.3, Sec2.2). They cooperate to learn vision-language semantic consistency.
  • Our comprehensive design is appropriately targeted at different aspects of semantic overlap and shifting.

Q4: Clarify evidence of the proposed approach to have addressed the claimed problems (MR,R3)

  • For semantic overlap, the overlap rate in negative samples is the evidence (20% reduces to 6.7% with KSE for pleural effusion). The rate is defined as $\sum_{j,L{j}=L{i}}{weight_{i,j}} / N_{Neg}$, where j is negative sample, i is positive one and L is the disease label. Weight_{i,j} is \lambda in KSE and 1.0 commonly.
  • For semantic shifting, syn-sim and neg-sim are the evidence (0.98 and 0.99 by text embedding, 0.60 and -0.05 by knowledge embedding for atelectasis). Syn-sim is defined as similarities between near concepts (lung collapse), and neg-sim is that between negation concepts (no atelectasis). It verifies knowledge embedding is able to recognize sources of semantic shifting.

Clear Writing (MR, R2-“well-structure”, R1-“good clarity”, R4-“well-organized”) This design of two modules is closely tied to motivations, with a complete logical chain and notations.

Q5: Initialization methods (MR,R1) Random initialization of negative knowledge embedding is able to ensure the opposite semantics.

  • Our random generation is from a fixed distribution with little and stable variance (Sec.2.2), while learned features with positive semantics are sampled from another distribution with clusters after graph pre-training.

Well-designed Experiments (MR-“easy to follow”, R2-“well-designed”, R3-“comprehensive”, R4-“show improvement”) Our experiments are verified in eight tasks under five datasets with zero-shot or few-shot-frozen settings, with data amount, module ablation and visualization.

Q6: Recommended Comparison (MR,R2,R3,R4) Thank you for providing us BioViL[ECCV2022], MRM[ICLR2023], REFERS[NMI2022] and Med-VLP[ACMMM2023]. Selected results following: CheXpert-Zeroshot: BioViL(0.831), KoBo(0.859) CheXpert-Frozen(1%): MRM(0.859), Med-VLP(0.851), REFERS(0.855), KoBo(0.866) …

  • Med-VLP and REFERS are transformer-based work for a joint embedding, so it is unable to compare with zero-shot setting.

Q7: Clarify statistical analysis (MR,R2) We carefully calculate the confidence interval radius. Selected results of the recommended analysis:(%) CheXpert-Frozen(1%): KoBo(0.11) ConVIRT(0.46) Gloria(0.32) MGCA(0.18) …

Q8: Our comparison protocol (frozen) is more general in the time of foundation models (R4)

  • Frozen setting is more memory-efficient and faster when transferred.
  • Frozen setting protects effective features in encoder which is pre-trained.

Thank you for your thorough reviews and comments.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper has potential but requires significant revisions and clarifications. The concerns raised by the reviewers regarding the arbitrary experimental setup, the complexity of the proposed method, the performance in retrieval tasks, and the lack of ablation experiments have not been satisfactorily addressed in the rebuttal. The mentioned weaknesses, if adequately addressed, have the potential to strengthen the clarity, impact, and overall quality of the work.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper introduces a novel knowledge-boosting medical contrastive vision-language pre-training framework, aiming to tackle the semantic overlap and semantic shifting issues in medical diagnosis. The incorporation of domain knowledge to enhance the interaction between visual and textual modalities is a notable strength and valuable contribution of this work. The paper also identifies and addresses specific challenges in applying VLP to medical images, offering a novel approach to overcome these challenges. The experiments are well-designed and demonstrate the superior performance of the proposed method compared to baseline VLP methods. The paper’s structure is well-organized,.

    However, there are several weaknesses that need to be addressed. Firstly, the clarity of the introduced notation should be improved to enhance readers’ understanding. Additionally, the readability of the figures could be enhanced to facilitate comprehension.

    The authors’ decision to compare their results with state-of-the-art methods in the field in their response is appreciated, as it strengthens the significance of their contributions.

    Overall, the paper makes valuable contributions in the field of medical contrastive vision-language pre-training. Addressing the mentioned weaknesses will further enhance the clarity and impact of the work.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal fails to adequately address the main concerns raised by reviewers. This includes the arbitrary experimental setup, the complexity of the proposed method affecting optimization and generalization, underwhelming performance in retrieval tasks, marginal improvements in other tasks, and the absence of ablation experiments to assess the impact of individual modules.



back to top