Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Che Liu, Sibo Cheng, Chen Chen, Mengyun Qiao, Weitong Zhang, Anand Shah, Wenjia Bai, Rossella Arcucci

Abstract

This paper presents a novel approach to addressing the problem of chaotic latent space in self-supervised medical vision-language processing (VLP). Our proposed method is called Medical vision-language pre-training with Frozen language models and Latent spAce Geometry optimization (M-FLAG), which leverages a frozen language model for training stability and efficiency and employs two losses (a novel vision uniformity loss and a vision-language alignment loss) to harmonize the latent space geometry during VLP. Our extensive experimental results on three diverse downstream tasks: supervised image classification, semantic segmentation, and object detection, across five public datasets, demonstrate that M-FLAG significantly outperforms existing medical VLP approaches with much lower model complexity (reducing the number of parameters by 78%). Notably, M-FLAG achieves outstanding performance on the segmentation task while using only 1% of the RSNA dataset, even outperforming ImageNet pre-trained models that have been fine-tuned using 100% of the data.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_61

SharedIt: https://rdcu.be/dnwdI

Link to the code repository

https://github.com/cheliu-computation/m-flag-miccai2023

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a VL method that creates an optimized latent space. This is done in a memory-and data efficient way by keeping the language model frozen, and only training the vision branch. Eith the composite alignment loss this method is able to consistently reach top performance on multiple datasets and tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -The paper is presented in as very clear way. The structure of the paper is intuitive and enough details on the method and implementation are provided -By showing results on multiple different datasest and tasks the authors show the versatility of the method -The motivation behind the design of the alignment loss is clear.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -The discussion of table 5 and fig 2 is unclear. It is apparant from table that unfreezing the first layer leads to a performance degradtion, while unfreezing more layers then leads to a performance increase. This is a peculiar finding that should be discussed. Furthermore, the meaning of the PCA results shown in fig 2. is not explained. It seems that that unfreezing the language models leads to a better structurization of the latent space? How can this be linked to the results in table 2?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The method is reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    See weaknesses

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The general presentation, method novelty and results.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents M-FLAG, a novel method for pre-training and regularizing medical vision-language models. M-FLAG uses a frozen language model for stability and efficiency, and a new uniformity loss to harmonize the latent space geometry. The approach outperforms existing methods on medical image classification, segmentation, and object detection tasks across five public datasets while reducing the number of parameters by 78%. Notably, M-FLAG achieves exceptional segmentation performance using only 1% of the RSNA dataset, surpassing ImageNet pre-trained models fine-tuned with 100% of the data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Introducing a novel method (M-FLAG) for pre-training and regularizing medical vision-language models, which improves training stability and efficiency.
    • Incorporating a new uniformity loss to harmonize the latent space geometry, optimizing the model’s performance.
    • Demonstrating significant improvements over existing medical vision-language - pre-training approaches in various downstream tasks, such as medical image classification, segmentation, and object detection.
    • Achieving substantial parameter reduction (78%) while maintaining or improving performance, making the model more efficient and computationally economical.
    • Showcasing M-FLAG’s performance on the segmentation task with only 1% of the RSNA dataset, even outperforming ImageNet pre-trained models fine-tuned with 100% of the data, indicating its effectiveness in low-data scenarios.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While previous studies have applied the idea of using independent feature maps in various paradigms, like this paper “Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval” that applied diversity loss term to penalizes the redundancy among K locally guided features. Howevere, the combination of these two loss terms can be considered a novel approach.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    it can be reproduced using the provided information.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Citing papers that use the idea of diverse feature maps like what I provided.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of the approach, along with the extensive experiments and compelling results, led me to accept this paper.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Authors propose a Vision-Language Pre-training (VLP) model using an already pre-trained text encoder, which remains frozen during training. They evaluate the model extensively on 5 datasets and 3 problems (classification, segmentation and object detection).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Comparing vision and language latent space geometry is interesting with a lack of literature on the topic in the medical domain.
    • Experiments with frozen language model are thorough (across 5 datasets) and reproducible, which is highly appreciated.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The method requires a pre-trained and very efficient text encoder, which is, in this paper, domain-specific to radiology reports (CXR-BERT)
    • Comparison with [1,4] is missing: I would like to see how a simple CLIP-based approach (yet, currently SOTA) perform compared to M-Flag. Also, since the model is building up over CXR-BERT, it is crucial to compare it to BioViL (vision-language model introduced along with CXR-BERT encoder).
    • The authors claim M-Flag is less costly computationally than other methods but they use an already domain-specific language model as text encoder so this computational cost should be explicitly mentioned.
    • Uniformity described in eq. 3 is different from the original uniformity loss described in [3], which is misleading since we do not have a clear definition of uniformity anymore. This term is much closer to Barlow Twins loss (yet, distinct since the authors consider the empirical correlation matrix and not cross-correlation between views). I would call it differently and I would like to see a clear explanation before its definition.
    • “Differently, here we address this problem by employing a uniformity loss, which directly aligns the geometry of the latent space towards a uniform hypersphere to tackle the collapse problem.” → Related to previous comment, this claim is not true since uniformity loss is different from [3]
    • Results with unfrozen language model do not make sense and they suggest an over-fit/under-optimal loss function of the text encoder. It is in contradiction with recent work [1]. I strongly believe this discrepancy comes from the asymmetry in the proposed loss between image and text modalities: “uniformity” loss (eq. 3) is defined only for images representation but it should be symmetrically defined for text representation in these experiments.
    • For text encoding, do you use random sentence in radiologist reports or impressions as in [1] ? It has been shown [1] this implementation detail highly impacts the representation.
    • How the other baseline models are pre-trained in Table 2 and 3 ? Are they pre-trained on the same dataset as M-Flag? With what hyper-parameters ?
    • Unfounded claim: “It has been suggested that optimal vision and language latent spaces should be of different geometry” → the reference given (pre-print) only describes the latent space geometry of language models, without giving explicit link with vision latent space geometry. Overall, this question is interesting, but still, with an unclear answer [2] [1] Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning, Tiu et al., Nature Biomedical Engineering 2022 [2] Neural representational geometry underlies few-shot concept learning, Sorscher et al., PNAS 2022 [3] Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere, Wang and Isola, ICML 2020 [4] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing, ECCV 2022
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Datasets and splits used are widely used in the literature and should be easily reproducible. Hyper-parameters are clearly defined. Nevertheless, the code is not provided and I believe it should be added to improve the quality of this work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    While the results are promising, the method seems incremental compared to previous work [1]. I would like to see how this method performs compared to BioViL [1], otherwise it seems the performance are mainly driven by the pre-trained text encoder. Besides, the experiments with unfrozen language models are not well designed with curious results (see weaknesses). [1] Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning, Tiu et al., Nature Biomedical Engineering 2022

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See my comments.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    The comparison with [1] is still missing while it is crucial to check whether performance improvement is coming from the proposed loss or only from the pre-trained text encoder in [1], which would lower this work’s novelty. Arguments invoked (“different image size”) should be easy to fix in order to make the comparison.

    Regarding to the uniformity loss, the authors suggest that it should impose a Gaussian distribution for the visual features, a claim that should be mathematically demonstrated since I do not find this assertion obvious. Additionally, it is in contradiction with the first version of the manuscript: “we address this problem by employing a uniformity loss, which directly aligns the geometry of the latent space towards a uniform hypersphere to tackle the collapse problem.”

    The drop in performance when fine-tuning the text encoder is curious and it is not resolved after reading the authors’ response. It mostly indicates that the proposed loss is not well designed for training both vision and language models, and that the initial text encoder’s quality is crucial to guarantee good performance.

    Overall, the current results prevent me from accepting the paper as is. There are too many unclear/invalid points that would need to be clarified in a future version and that lowers the main contributions of this paper.

    [1] Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing, ECCV 2022




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a new medical image and medical report joint learning approach with a fixed text encoder. All reviewers appreciate the motivation (latent space geometry), the method novelty, and the good results. At the same time, R3 challenges the critical difference between the proposed method with other related methods and also makes some comments about clarification. Therefore, the authors are encouraged to respond R3’s comments via rebuttal and further improve the quality of this paper.

    Particularly, I notice that the frozen language model in the experiment is not a general language model but the specific CXR-BERT, which is already a pretrained text encoder with medical image and medical report, actually on the same MIMIC-CXR dataset (if I am correct). This is somehow inconsistent with the argument in the paper “In this work, we use a frozen text encoder ET , which can be obtained from any general language model.” The authors should also discuss this point and better motive this work and the benefit of this work.




Author Feedback

We express our gratitude to the AC and all reviewers for valuable feedback and recognizing the motivation and novelty of the method as well as its good performance. As suggested by AC, we first address R3’s main comments.

1) Difference from other methods [1,4] (R3): While [1] uses the original CLIP method and InfoNCE loss to align image-report, our approach employs MSE loss for image-report alignment and regulates the visual latent space via uniformity loss.

M-FLAG utilises a frozen text encoder, CXR-BERT, from [4]. It shows a computationally efficient way for groups with limited GPU resource to conduct research on large medical VLP model, which is a main message of this paper to the community. It substantially reduces the number of parameters by 4-fold (Tab. 2). Also, M-FLAG differs from [4] in pre-processing details (using 512x512 images vs 256x256, only using complete reports with IMPRESSION, introducing sentence shuffling within sections), which makes direct comparison challenging.

2) Clarification of results of unfrozen invariants (R3): [1] utilises a pre-trained CLIP model, where the text encoder is trained from generic text. This is perhaps why the model needs to be further trained using medical reports. On the contrary, our work uses a medical specific text encoder, CXR-BERT, and we find a frozen text encoder in this case would be a better choice. This will alleviate the need for regularising the text latent space. We only regularise the image latent space in Eq (3) using the uniformity loss. Also, we hypothesise there is some asymmetry between image latent space and text latent space. [5, 6] demonstrate that text embeddings are inherently more meaningful when mapped into a hyperbolic space. This differs from the hyperspherical structure of image latent space.

3) Discussion of uniformity loss (R3): The uniformity loss in M-FLAG is intended to regularise the visual latent space to a Gaussian distribution, whereas the uniformity loss in [3] is designed to distance negative samples from positive samples. They have different meanings. To avoid confusion, we will rephrase ‘uniformity’ as ‘orthogonality’ in the revised paper.

4) AC suggests a discussion about the pre-trained text encoder. The CXR-BERT model is available in two variants, a general version and a specific version. We employ the general version, which was pre-trained on the PubMed dataset and clinical notes from MIMIC-III and MIMIC-CXR. No medical image was used in pre-training. This pre-trained model is publicly accessible on Hugging Face. We will rephrase our statement: “we use a frozen text encoder E_T, which is pre-trained and publicly available”.

Finally, we address other minor comments.

5) Discussion of Tab 5 and Fig 2 (R1): While unfreezing 2 or 3 layers slightly boosts performance due to increased trainable parameters, the trend is not monotonic as performance decreases again when unfreezing 5 layers (Table 5). All these variations still underperform compared to our all frozen-layer approach. Despite unfreezing leads to a more structured latent space (Fig 2) via PCA, it does not ensure an expected Gaussian structure, which M-FLAG can better achieve (see green box).

6) Discussion of the Polysemous Visual-Semantic Embedding paper [7] (R2): While this paper employs multiple embeddings to represent the semantic meanings of different instances, M-FLAG aligns vision and language using a single embedding and focuses on regularising the visual latent space structure. We will add this paper in revision.

7) Discussion on text encoding (R3): Following MGCA, we concatenate the findings and impression sections of the report into a single input to the text encoder.

We will incorporate these changes in the revised paper.

[1] Tiu, Nature Biomedical Engineering 2022. [2] Sorscher, PNAS 2022. [3] Wang and Isola, ICML 2020. [4] Boecking, ECCV 2022. [5] Fu, arXiv:2206.01512. [6] Chen, Probing BERT in hyperbolic spaces, ICLR 2021. [7] Song, CVPR 2019.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes new ideas and views for medical vision-language pre-training, including latent space geometry. However, the rebuttal did not address R3’s comments on unclear/invalid points and discusison/comparison with CLIP-baseline method and BioViL. The authors are suggested to improve this work according to the comments and submit it for next venue.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    While R3 remains unconvinced by the author’s rebuttal, I personally find the rebuttal to be valid and satisfactory in addressing the concerns raised. However, I suggest that the clarification regarding [1] and CXR-BERT should be incorporated cautiously and thoughtfully in the revised version of the paper. After thoroughly evaluating the author’s rebuttal and taking into account the recommendations of R1 and R2, I also recommend accepting the paper.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Considering the strengths and weaknesses identified, the paper presents a novel approach for joint learning of medical images and reports. The motivation and results are appreciated by most reviewers. However, some of the concerns raised by Reviewer 3 regarding the inconsistencies in the use of the frozen text encoder, the clarification of the uniformity loss, the lack of comparison with the related work [1], and the drop in performance during fine-tuning indicate have not been adequately addressed. Based on the current state of the paper and the concerns raised, it is not suitable for acceptance in its current form. The authors are encouraged to address the raised concerns, provide necessary clarifications, and conduct a proper comparison with the relevant work to strengthen the paper for potential future submission.



back to top