Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yunkun Zhang, Jin Gao, Mu Zhou, Xiaosong Wang, Yu Qiao, Shaoting Zhang, Dequan Wang

Abstract

The recent surge of foundation models in computer vision and natural language processing opens up perspectives in utilizing multi-modal clinical data to train large models with strong generalizability. Yet pathological image datasets often lack biomedical text annotation and enrichment. Guiding data-efficient image diagnosis from the use of biomedical text knowledge becomes a substantial interest. In this paper, we propose to Connect Image and Text Embeddings (CITE) to enhance pathological image classification. CITE injects text insights gained from language models pre-trained with a broad range of biomedical texts, leading to adapt foundation models towards pathological image understanding. Through extensive experiments on the PatchGastric stomach tumor pathological image dataset, we demonstrate that CITE achieves leading performance compared with various baselines especially when training data is scarce. CITE offers insights into leveraging in-domain text knowledge to reinforce data-efficient pathological image classification. Code is available at https://github.com/Yunkun-Zhang/CITE.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_27

SharedIt: https://rdcu.be/dnwG5

Link to the code repository

https://github.com/Yunkun-Zhang/CITE

Link to the dataset(s)

https://zenodo.org/record/6550925


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed a new approach to adapting a foundation model for the enhancement of pathological image classification by using biomedical text annotation of the image, named “Connect Image and Text Embeddings (CITE). Experimental results show CITE outperforms all baselines under all data scales including supervised learning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper presents a novel approach to inject biomedical text annotations of pathological images into a foundation model for improved pathological image classification. The paper uses an image encoder, a text encoder and a strategy to connect text and image by a projection and classify the images by comparing the cosine similarity between the image and text embeddings.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although the experimental results are outstanding comparing to all baselines, the paper does not provide any information on computational performance.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper used a publicly available dataset PatchGastricADC22. The source code information is not provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The paper can be further improved by (1) Testing the proposed method using more datasets. (2) Presenting more information on how the proposed model is trained. (3) Providing information on the computational performance of the proposed model. (4) More previous related papers should be referenced in the area of visual language models. (5) Rectifying some typos such as “1000 iterations” (should be “1,000 iterations”.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presented an important/cutting edge research topic of broad significance. There are some novel aspects and the quality of the science is competent with no major flaws. The experimental design and evaluation of the proposed method are satisfactory, and the conclusions are justified. The manuscript is well written and is easy to follow. The paper can be further improved by addressing the points listed in detailed and constructive comments above.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    The authors have satisfactorily addressed most of my concerns.



Review #2

  • Please describe the contribution of the paper

    The paper aims to improve pathological image classification by injecting medical domain knowledge from language models pre-trained with biomedical texts. Visual prompt tuning model is used to reinforce data-efficient pathological image classification. Work was benchmarked on the PatchGastric stomach tumor pathological image dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Using medical knowledge for visual prompt tuning
    • Comparing performance with different visual encoders
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Paper hard to read but grammatically fine.
    • Work is a direct application of https://arxiv.org/pdf/2203.12119.pdf
    • Related work is shallow.
    • Statistical significance and standard deviation of results not reported.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors used an open source dataset and code, therefore results should be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • It is usually known that the model hyperparameters were optimized for a certain data type. Not modifying pretrained model parameters seems to be limiting, given medical data characteristics are specific not generic.
    • Related work section does not cover recent work as it deserves. Fine-tuning model hyperparameters is essential to optimize performance and avoid overfitting/underfitting. Besides, annotating data does not necessarily need to be fully-supervised, there are a number of recent contributions in the literature which follow a semi-supervised strategy. Please revise and compare accordingly.
    • How did prompt tuning deal with increased computational cost, due to training models with different prompts, and generalization? Training time can be compared and approach can be applied to unseen dataset, or mention as a limitation.
    • Please report statistical significance of results in Table 1; without standard deviations, it would be hard to appreciate the results in Table 2.
    • Please define ‘v’ subscript in equation 1
    • Unclear statements in the context, e.g. ‘… can hardly meet the requirement of a large model capacity’; ‘we underscore the utilization of biomedical language models for …’; ‘Despite the promise of general foundation models …’. Please rephrase and check for writing correctness throughout the text.
    • Avoid using too much ‘e.g.’ in writing and be explicit in your statement, with no need to italicize ‘e.g’. Excessive use of Latin abbreviations causes disconnection in writing, which affects readability.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good application, no theoretical or mathematical novelty.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper propose a parameter-efficient fine-tuning method based on the CLIP framework for the alignment of text and visual encoders in the histopathological image classification area.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is a prior work for CLIP-based models in the histopathological image classification area, the paper demonstrate the potential of applying text reports in such an area.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. This article falls short in addressing the key challenges in pathological image diagnosis, particularly how to classify high-resolution Whole Slide Images. The article appears to only transfer the CLIP model from natural image and natural language processing fields, which deviates from the practical needs of pathological image classification problems.

    2. The chosen classification task is limited to distinguishing between high and low differentiation, which does not require considering context information across the entire image, only local cell structures are enough. This allows the use of patch-wise prediction methods and global voting classification methods, but for more general pathological image classification problems, particularly tasks that require information of different scales at both the global and local levels.

    3. Recent methods such as CLAM, DSMIL, TransMIL have not been included in the baseline. The baseline methods are far too weak, especially the CLAM paper (published in Nature Medicine) discussed its weakly supervised performance.

    4. In page 6, are you sure you use 20% data for training and 80% data for validation? Furthermore, to demonstrate your model’s performance, a training/validation/testing split, or cross-validation is required, rather than a simple training/validation split.

    5. The project layer described in Section 3.1 is not reflected in the figure.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    “The average runtime for each result, or estimated energy cost.”, “A description of the memory footprint.” and “An analysis of situations in which the method failed.” can be and should be reported, instead of “Not Applicable”. The authors provide an empty github link for the source code repo, I understand it means that you can fill it if accepted, but I remember that it is not allowed to provide the link.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The topic of this paper should better go to other conference like ICCV whose deadline is the same as MICCAI, or NIPS later on. The authors might conduct more experiments on more tasks, and submit the revised version to conferences focusing more on the novelty of AI models.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    2

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Considering the weakness in the problem settings, in the lack of baseline models and missing related works (such as sota DSMIL, TransMIL, CLAM models, and so on), in the data splitting and experimental procedures, I would give a strong reject and recommend the authors to make up for the weakness and re-submit to conferences like NIPS, AAAI.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Based on the response of the authors, I would like to raise my rating for the paper and recommend an acceptance.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper under review presents a compelling approach for enhancing pathological image classification through the injection of biomedical text annotations into a foundation model. Overall, the paper is well-written, but there are several key points that require the authors’ attention and clarification in rebuttal phase:

    • Computational cost analysis and prompt tuning: the authors need to provide a comprehensive computational cost analysis, addressing the implications of training models with different prompts. It is crucial to explain how prompt tuning was utilized to handle the increased computational cost and whether any computational optimization techniques were employed. Elaborating on these aspects would enhance the paper’s overall rigor and applicability.
    • Detailed model training description: The paper needs enhancement and in-depth details regarding the training process of the proposed model as detailed by reviewers. By providing these additional details, readers will gain a better understanding of the reproducibility and reliability of the results.
    • Strengthening literature review and credit assignment: The authors should consider enhancing the literature review section by including a more comprehensive survey of relevant previous work.
    • Baseline: Concerns regarding proper baseline strategy needs to be addressed.




Author Feedback

We appreciate all valuable comments from the AC and three reviewers (R1, R2, R4). The following is our response. All updates will be made in the final submission.

Model training details and contribution (R1, AC) We will add the details of our model structure and the training process. Our key contribution is to assess the transferable ability of foundation models with specific medical text guidance in pathological image diagnosis. Our training process deviates from the vanilla VPT. First, an image patch from the WSI is embedded into a sequence of tokens. A number of tokens with learnable parameters, called the visual prompt, are then prepended to the sequence. The whole sequence is processed through the pre-trained vision transformer blocks and a projection layer to get the image feature. The key design is that we leverage text information to inject medical domain knowledge. We obtain class-specific text features by processing the class texts (well/moderately/poorly differentiated) through a pre-trained biomedical language model. The image prediction is computed as the cosine similarity between the image feature and the three text features (representing three individual classes). The predictions are then used to compute the cross-entropy loss with a slide-level ground truth label (i.e., all the patches from one WSI have the same label in a weakly-supervised manner). Finally, the visual prompts are updated along with the projection layer via backpropagation. During model inference, the predictions of all the patches from one WSI perform a soft vote for the whole slide classification.

Computational cost (R1, R2, R4, and AC) Using CLIP ViT-B/16 visual encoder and BiolinkBert-large textual encoder, the averaged runtime of 1,000 training iterations is 11 minutes on two NVIDIA GeForce RTX 2080 Ti GPUs. With a total mini-batch size of 128 and a visual prompt length of 1, it requires 11.6GB of GPU memory. We did not employ computational optimization techniques. We experiment with the visual prompt length (VPL) ranging from 1 to 50, which takes 11.6 to 14.6GB of memory, but roughly in similar training times. And we found the length does not significantly influence the accuracy (60.1-60.8%).

Literature review (R1, R2, AC) We will add related references about vision-language models: BEiT (W. Wang et al., 2022), LiT (X. Zhai et al., 2022), BLIP (J. Li et al., 2023), VLMo (H. Bao et al., 2022); Also, references about data-efficient fine-tuning: Medical Image Classification Using Deep Learning (W. Wang et al., 2020), CLAM (M. Y. Lu et al., 2021); semi-supervised data annotation (Y. Vindas et al., 2022) will also appear in the final version.

Baseline strategy (R4, AC) We added CLAM in comparison with the same CLIP ViT-B/16 as the feature extractor. The results are compared with Figure 3 (AVG±STD): For 1/2/4/8/16/All slides-per-class respectively, CLAM results are 51.4±1.8, 53.3±1.7, 54.9±1.8, 65.9±0.9, 64.1±1.8, and 68.1±0.6, while CITE (ours) results are 60.2±1.2, 58.6±0.7, 60.2±0.8, 65.7±0.9, 67.9±0.5, and 69.7±0.1. Our method performs remarkably better, especially when we only use one slide per class to train.

Why fix pre-trained models? (R2) This setting can be flexible in a manner where foundation model parameters are not open-sourced for tuning. Also, it can be applicable to devices with low computational resources.

When did the method fail? (R4) Our approach relies on the input of a text-guided language model, so it may fail when the knowledge required is not included by the language model.

Statistical significance (R2) Due to limited space, we only show the average standard deviation of each row in Table 1: 1.12, 1.54, 0.50, 0.59. We will show the full statistics in the final submission.

Data split (R4) We follow the setting of few-shot learning (i.e., less training data and more testing data), so we are sure we used a 20% train + 80% validation split.

Reproducibility (R2, R4) We will open-source the code upon acceptance.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The reviewers have recognized the strengths of the work while also identifying certain shortcomings. However, the authors have effectively addressed concerns related to the baseline strategy, computational cost, previous methods, and other relevant aspects. Considering the high merit and interesting nature of the paper, I recommend accepting the work.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I agree that foundation models and their adaptation for medical image analysis is highly relevant. The authors were able to address previous concerns in the rebuttal so that, now, all reviewers rate the paper positively.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Initial reviews are most positive. Rebuttal addressed concerns from a reviewer who changed to a higher score from an initial low score. This topic on foundation model is fresh and interesting and will be beneficial to the MICCAI community.



back to top