Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yongjian Wu, Yang Zhou, Jiya Saiyin, Bingzheng Wei, Maode Lai, Jianzhong Shou, Yubo Fan, Yan Xu

Abstract

Large-scale visual-language pre-trained models (VLPM) have proven their excellent performance in downstream object detection for natural scenes. However, zero-shot nuclei detection on H\&E images via VLPMs remains underexplored. The large gap between medical images and the web-originated text-image pairs used for pre-training makes it a challenging task. In this paper, we attempt to explore the potential of the object-level VLPM, Grounded Language-Image Pre-training (GLIP) model, for zero-shot nuclei detection. Concretely, an automatic prompts design pipeline is devised based on the association binding trait of VLPM and the image-to-text VLPM BLIP, avoiding empirical manual prompts engineering. We further establish a self-training framework, using the automatically designed prompts to generate the preliminary results as pseudo labels from GLIP and refine the predicted boxes in an iterative manner. Our method achieves a remarkable performance for label-free nuclei detection, surpassing other comparison methods. Foremost, our work demonstrates that the VLPM pre-trained on natural image-text pairs exhibits astonishing potential for downstream tasks in the medical field as well. Code will be released at https://github.com/wuyongjianCODE/VLPMNuD.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_67

SharedIt: https://rdcu.be/dnwKq

Link to the code repository

https://github.com/wuyongjianCODE/VLPMNuD

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    The work leverages visual-language pre-trained models (VLPM) for zero-shot Nuclei detection. Since VLPM models are mostly trained on web content, applications in medical images need to adapt the prompts used to guide the model’s output. The main contribution is a self training paradigm that automates the prompt generation by using an image to text VLPM instead of hancrafting prompts. In this way the prompts are more effective since they use the same type of language that the models are originally trained on but still effective in describing objects of interest in medical imaging. Moreover, the generation of appropriate attribute words for a domain adds interpretability.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Simplicity: The idea is conceptually simple and open a new line of work that leverages existing approaches for visual-language models for their use in medical images. Novelty: Self training that adapts the prompts from domain specific language to the language that is better used by the models is an interesting avenue for research and applications in medical images. Relevance: the proposed approach is innovative without having to reinvent the tools that it uses. This has great potential for wider adoption.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The result of the paper are nice and provide good proof of concept, but it would have been great to have listed a second application to promote the generality of the approach. The description of the experiments could have included more details for the box optimization process.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The schematic of the framework is a good guidance for implementing a similar pipeline. However, a lack of detail in how the self training is accomplished can mean that the tables in the paper may not be easy to reproduce without accompanying code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    More details on the implementation. The writing can be improved by defining several of the acronyms, for instance GLIP or CLIP are never introduced in the paper. This is also a problem in the abstract.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The simplicity and novelty of the idea leveraging on existing tools rather than creating a new tool. This has potential for impact.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper presents three contributions: Firstly, they propose a new framework for detecting nuclei without using labels, which is based on VLPMs. Secondly, it utilizes GLIP instead of CLIP to generate visual representations that prioritize object-level learning and result in higher quality language-aware visual features, leading to improved nuclei retrieval. Thirdly, it establishes an automated prompt design process that takes advantage of VLPMs’ association binding trait to avoid the need for complex manual prompt engineering.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is visually appealing. It is evident that a good amount of time and effort has been devoted to creating the figures. Fig1 provides an excellent overview of the framework. A substantial effort has been made to explain the method clearly, particularly the difference between GLIP and BLIP and CLIP, and the rationale for focusing on GLIP. The method employed is relatively recent, and the concept of utilizing VLM to detect unlabeled medical objects in a zero-shot manner is interesting.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There is a limited amount of technical novelty and technical descriptions available. For the comparison table, I wish there were more supervised methods presented to really know if the unserpervised proposed method is doing better than the baslines.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Although the claim in the paper states that the code is available at a particular repository address, it is important to note that the code has not yet been shared in the supplementary material. As a result, there is a possibility that this claim may not be fulfilled in the future. Its dataset is publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Although the approach is interesting, the absence of baselines and alternative methods disconnects it from the existing literature. It remains unclear whether these techniques would outperform established methods. Also, releasing the code would further improve the importance of the paper.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors’ ablation study on how various prompts perform in a similar setting is interesting. However, it would be beneficial to understand how each prompting strategy performs on different datasets and whether they are applicable to similar datasets. The comparison to the supervised methods as the baseline is limited, and it is uncertain if the proposed automatic prompting truly outperforms the other supervised settings.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper explores zero-shot nuclei detection via visual-language pre-trained models (VLPM). Specifically, this paper design an automatic prompt generation pipeline based on VLPM GLIP and image-to-text VLPM BLIP. To refine the coarse detection results, a self-training framework is introduced to improve the detection results in an iterative manner. The experiment on public MoNuSeg dataset show its effectiveness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Applying vision-language pre-trained model to zero-shot nuclei detection is a interesting work. The proposed method achieves SOTA result on a public dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The gap between medical images and the web-originated text-image pairs used for pre-training is the main concern when applying VLPM. However, the author still uses the VLPM network for medical prompt generation without much modification.
    2. The automatic prompt design is a key module in the proposed method. Within the module, the novelty is not clear.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Key modules, GLIP, BLIP and YOLOX follow default setting. It should be easy to reproduce the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Compared to the related work VLDet, what challenges did the proposed method solve? What advantages does it have?
    2. can the self-training module be applied to other unsupervised methods [23][11][14] and boost the performance? As shown in Table 1 in the supplementary, it seems that the superior results largely benefit from the self-training boosting.
    3. The automatic prompt design could embody excellent interpretability. It would be better to include output visualization comparison with different prompts to support this claim.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The paper asserts that its main contribution lies in the utilization of GLIP [13], which yields superior language-sensitive visual representations compared to CLIP. However, the paper does not introduce many novel designs based on GLIP.
    2. The automatic prompt design is the main focus of this paper. However, as highlighted in the supplementary Table 1, the superior performance of the model can be attributed to YOLOX [4] rather than the suggested automatic prompt design.
  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The effectiveness of the prompt generation module has been validated through supplementary experiments.



Review #5

  • Please describe the contribution of the paper

    This paper explores the potential of the Large-scale visual-language pre-trained models (VLPM), for zero-shot nuclei detection on H&E images. This paper proposes an automatic prompts design pipeline and self-training framework, achieving remarkable performance and demonstrating the potential of VLPM pre-training for medical field downstream tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1, This paper is well-written and easy to follow.

    2, Exploring Large-scale visual-language pre-trained models for medical image analysis is an interesting and valuable contribution to the field.

    3, The Automatic Prompt Design and Self-training Boosting techniques employed in the paper are innovative and demonstrate the potential for these approaches to improve performance in medical image analysis tasks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    To further strengthen the paper, additional ablation study experiments (different prompts) and visualizations could be included to provide a more comprehensive evaluation of the proposed approach and help readers better understand the factors that contribute to its performance.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    authors claim that the code will be available, and there are some experimental details in the paper

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    see weakness

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    see strengths

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I’ve checked authors’ rebuttal, and I hold the accept dicision.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This manuscript presents an interesting method for zero-shot nuclei detection in histopathological images, by using a large-scale visual-language pre-trained model (VLPM), i.e., GLIP [13]. It designs an automatic prompts design pipeline, mainly based on the text-to-image alignment of BLIP [12] and GLIP models, and eliminate non-trivial manual prompt engineering. The method also introduces a self-training module to refine nuclei prediction in an iterative manner. The experiments show that the proposed method outperforms some other unsupervised learning for nuclei detection.

    However, there are some concerns or weaknesses raised by the reviewers as follows:

    1. The description of the self-training module is not sufficient, e.g., how the self training is accomplished (Reviewer #2).

    2. The technical novelty of the proposed framework is limited or not clear (Reviewers #3 and #4), because the key component, automatic prompt design, is mainly built on GLIP [13] and BLIP [12].

    3. Clarify if the superior performance of the proposed method is attributed to YOLOX [4] rather than the automatic prompt design (Reviewer #4).

    4. Explain the difference and the advantages of the proposed framework compared with other related work such as VLDet [14] (Reviewer #4).

    Please consider addressing the comments above in the rebuttal.




Author Feedback

Novelty (MR2 R3 R4) The novelty of our work lies in the establishment of an efficient system that explores the transfer of rich semantic knowledge exhibited in the VLPMs from natural scene to the task of H&E nuclei detection. This task is non-trivial, as evidenced by the unsatisfactory results achieved by simple designs (1st two rows of Table 1, 1st row of supplement Table 1). Our proposed pipeline maximizes the potential of VLPMs and outperforms other compared unsupervised methods. Ablation studies and the new table mentioned below demonstrate the indispensability of both automatic prompt design and self-training. Additionally, our method identifies patterns for designing prompts for unseen medical imaging, facilitating the application of VLPMs to other medical imaging tasks, which can significantly reduce annotation efforts.

Clarification on the significance of the automatic prompt design using VLPMs for pseudo label generation (MR3 R4 R5) YOLOX and self-training are not the essential reasons for the superior performance of our method. The true key is the utilization of semantic-information-rich VLPMs. To illustrate this point, we employed another commonly used unsupervised detection method, superpixels [1], to generate pseudo labels in a zero-shot manner for a fair comparison. These pseudo labels were then fed into the self-training framework based on the YOLOX segmentation architecture, keeping the settings consistent with our approach except for pseudo label generation. The results are shown in the table below, revealing poor performance. This demonstrates that the high performance of our method (mAP: 0.416, AP50: 0.808, AP75: 0.382, AR: 0.502) lies in the effective utilization of the knowledge from VLPMs rather than YOLOX or self-training. Method|mAP|AP50|AP75|AR SP|0.027|0.075|0.012|0.035 YOLOXs1|0.260|0.612|0.162|0.373 YOLOXs2|0.272|0.617|0.183|0.392 YOLOXs3|0.284|0.655|0.169|0.404 YOLOXs4|0.279|0.614|0.213|0.389 Additionally, by using our framework with DETR [2] instead of YOLOX, our method also achieves promising results that are comparable to the fully-supervised counterpart: Method|mAP|AP50|AP75|AR fully DETR|0.404|0.749|0.398|0.501 ours+DETR|0.388|0.731|0.376|0.487

The difference and advantages compared to VLDet (MR4 R4) (1) VLDet uses CLIP, which is not as effective as GLIP in object-level representation learning. Another CLIP-based method, VL-PLM, also underperforms in this task (supplement table 3). (2) VLDet focuses on natural images and does not address the domain gap between natural domain and unseen domains including H&E images. In contrast, we present the automatic prompt design specifically for this purpose, effectively addressing the domain gap problem. (3) VLDet does not optimize for scenarios with dense and overlapping nuclei, which leads to poor performance in such tasks. We use self-training for further box refinement, greatly improving the accuracy in this challenging scenario.

Description of the self-training module (MR1 R2 R3) Our self-training approach follows the standard methodology described in [3]. The detailed parameters will be made available alongside the code upon acceptance.

Generality (R2) We validated our method on another dataset, CCRCC [4], to demonstrate its generality. The results are shown below. Method|mAP|AP50|AP75|AR YOLOX|0.634|0.906|0.732|0.705 Ours|0.591|0.879|0.683|0.679

The minor issues, such as writing refinement (R2) and visualization (R4 R5), will be modified in the revised version of the manuscript, or released along with the code.

[1] R Achanta, et al. “SLIC superpixels compared to state-of-the-art superpixel methods.” IEEE TPAMI. [2] N Carion, et al. “End-to-end object detection with transformers.” ECCV 2020: [3] I Dópido, et al. “Semisupervised self-learning for hyperspectral image classification.” IEEE TGRS. [4] Z Gao, et al. “Nuclei grading of clear cell renal cell carcinoma in histopathological image by composite high-resolution network.” MICCAI 2021.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This manuscript presents a method to effectively leverage rich semantic knowledge in a large-scale visual-language pre-trained model, which is trained with natural images and text, for nuclei detection in medical images. The experimental results are promising. The rebuttal has addressed the main concerns from the reviewers regarding technical novelty, the effectiveness of the automatic prompt design and the advantages of the method compared with some other related work. All reviewers suggested (weak) acceptance after the rebuttal.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper explore the zero-shot nuclei detection on histological images using large-scale visual-language pretrained model. They design an automatic prompts pipeline for nuclei segmentation and develop a self-training framework that use GLIP generated preliminary results as pseudo labels and refine the predicted boxes in an iterative manner. The framework is evaluated on public dataset and achieve close performance as the fully-supervised approach. Almost reviewers acknowledge the novelty of the proposed method conceptually, but have some concerns for the refinement network and other details, which have been well addressed in the author rebuttal. Therefore, I suggest to accept the paper



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper presents a method for using a visual-language pre-trained models (VLPMs) for nuclei detection, using grounded language-image pre-training (GLIP) and a prompt adaptation. The reviewers see the approach as an interesting and simple method to utilize VLPMs for medical image analysis, where the method does not require supervised training. Weaknesses mentioned in the initial rebuttal include an unclear innovation and limited experiential evaluation. In their rebuttal, the authors address the novelty-aspect by highlighting the performance difference to a straight-forward adaption of VLPMs and provide additional experimental results for different settings. This brings all reviewers to rate the paper as weak accept or accept.

    Of note, providing additional experimental results in the rebuttal is not ideal as the MICCAI review-rebuttal process does not really allow for a detailed assessment in the context of the paper.

    The reviewer assessments and the adaption of a timely topic (VLPMs & multimodal learning together with zero-shot detection) justifies presentation and discussion at MICCAI from my perspective.



back to top