Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Miaotian Guo, Huahui Yi, Ziyuan Qin, Haiying Wang, Aidong Men, Qicheng Lao

Abstract

The success of large-scale pre-trained vision-language models (VLM) has provided a promising direction of transferring natural image representations to the medical domain by providing a well-designed prompt with medical expert-level knowledge. However, one prompt has difficulty in describing the medical lesions thoroughly enough and containing all the attributes. Besides, the models pre-trained with natural images fail to detect lesions precisely. To solve this problem, fusing multiple prompts is vital to assist the VLM in learning a more comprehensive alignment between textual and visual modalities. In this paper, we propose an ensemble guided fusion approach to leverage multiple statements when tackling the phrase grounding task for zero-shot lesion detection. Extensive experiments are conducted on three public medical image datasets across different modalities and the detection accuracy improvement demonstrates the superiority of our method.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_28

SharedIt: https://rdcu.be/dnwG6

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes an ensemble guided fusion approach combining multiple prompts to tackle the phrase grounding task for zero-shot lesion detection. With the help of vision-language model and prompts, the method demonstrates a good zero-shot ability and outperforms baseline methods by a large margin.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) The method leverages the prevailing vision-language model to tackle the medical lesion detection task. The outstanding zero-shot ability of vision-language model can well address the laborious data labeling process. The application of vision language model in medical image analysis is novel and straightforward. (2) The fusion of multiple prompts is reasonable and effective, showing that introducing comprehensive description of the object is vital. (3) Experiments are sufficient to support the effective and optimal design of the method.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) More prompts example should be provided in the main text, such as in Fig. 1, for the readers to better understand the method. (2) More discussions and explanations of Fig. 2 should be included. It is not much obvious to get a conclusion based on Fig 2. For example, the authors only show the results of the “in the skin & a/symmetrical shape” and “in the skin & red/black or xxx brown”. More results from every possible prompt combination should be provided. (3) Even the localization ability is supported by the visualization results, the confidence scores under zero-shot setting are still low. It would be better to provide the corresponding confidence scores under 10-shot fine-tunning setting. (4) An explanation for the sub-figure in Fig. 2 is required. The Ensemble Guided Fusion under the zero-shot setting is much worse than single prompts and Syntax Based Fusion.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Given that the authors enough implemental details, the datasets are publicly available, and the description in method section is clear, the reproducibility of the paper is ensured.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Please refer to the Main Weaknesses part.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper is well-written. The claims and methods are supported by the relatively complete experiments. Though there still exists minor weaknesses, the paper meets the bar of acceptance. Further explanations for some minor details would help improve this paper towards a better version.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

In this work the authors propose an ensemble guided fusion approach of multiple prompts for vision language models in the context of lesion detection and specifically aim at phrase grounding tasks in zero-shot learning settings. The proposed approach is evaluated on several publicly available medical datasets for medical detection tasks. The results are compared to other state of the art single prompt and ensemble learning approaches for zero-shot lesion detection.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

• The paper proposes an interesting prompt fusion strategy, that yields better results then single prompts for VLMs. • The results are outperforming other state of the art single prompt and ensemble approaches in the conducted zero-shot experiments. • An ablation study on different prompt fusion approaches is given.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

My main concern with this work, although done in the important context of zero-shot learning is competitivity in real world settings. Medical applications need high accuracies to minimize influencing wrong decisions of medical experts relying on such systems. The demonstrated performance, although very good in the selected setting, would not be deployable in a real-world scenario.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The work was evaluated on a publicly available dataset. According to the checklist, code will be released after acceptance. This information is not included in the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

• I’m missing a comparison or discussion regarding to [1], where basically is shown, that LLMs are better prompt engineers then humans. How is your approach competing against [1]? In other words is the proposed prompt fusion approach competitive to automatically engineered prompts done by another LLM? • I’m missing the comparison to a standard baseline model. How are you comparing e.g., against a SOTA Yolo model? Are the zero-shot benefits justifiable in terms of reduction in accuracy? Although doing research on zero-shot settings is important, the benefit of this approach in a medical setting is questionable. No one would apply a method with AP below 50% in a real world (especially medical) scenario. • The limitations as well as a future perspective of the proposed approach is missing. Minor: • Extraordinary superiority is somewhat over exaggerated. Please consider rephrasing. • The last two sentences in the Ensemble Learning SOTA paragraph seem a little out of place and can be interpreted as included in the proposed work by the authors. Please consider rephrasing.

[1] Zhou et al. Large Language Models Are Human-Level Prompt Engineers. https://arxiv.org/abs/2211.01910
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although advancing the field of zero-shot learning in the context of VLMs with creative ideas e.g. in prompt engineering, on the one hand the applicability in real-world settings is questionable and on the the other hand LLM refined and optimized prompts might yield better results (which is not discussed in this work).
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

With appropriate choice of text prompts, pre-trained VLMs can be readily used to perform medical imaging tasks such as lesion detection. However, a single prompt is often not sufficient to describe the object of interest, such as a lesion or a cell. In addition, a single prompt can usually produce multiple candidates which may or may not carry useful information. This paper tries to address the problem. Multiple prompts are used to obtain multiple candidate regions. The candiate regions are filtered using an ensemble clustering approach and then integrated to produce the output. Authors also propose combining this approach with language based prompt fusion. The proposed system is compared with several baseline zero- and few shot baseline models and is shown to significantly outperform the baseline models in lesion detection and cell segmentation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The idea of condensing and filtering results produced by VLM using ensemble clustering is novel. Authors further combine it with language based prompt fusion. The proposed system produces strong results which are backed by a variety of experiments and comparisons with multiple baselines.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Many finer details about the specific choices in clustering ensemble and language based fusion are lacking. without these details results may not be reproducible. It is not mentioned anywhere in the paper whether the code will be open-sourced.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It is not mentioned anywhere in the paper whether the code will be open-sourced. At the same time, many finer details about the ensemble clustering, language based fusion are not provided and without them the results cannot be reproduced.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- The high level discription of the framework is really well written. However, finer details such as mathematical formulae, parameter settings, metric choices for clustering, etc. which are also important are not provided in the main paper or in the supplementary document. Please include those details for reproducibility.
- What is the dot operator in equation 3? Is it product? Concatenation?
- Please provide details of the ensemble clustering framework. How many clusterers? What similarity or distance measure was used? How is number of cluster determined, particularly in size clustering?
- How is mutual independence determined in the integration module?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- This is a good paper. The proposed idea of using ensemble clustering to condense and filter the output of the VLM is novel. The high level idea is well presented. Authors claim strong results and provide plenty of experimental evidence to back the claims. However, many finer details are missing which would make it hard to reproduce the results. It is not mentioned anywhere in the paper whether the code will be open-sourced.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

7
[Post rebuttal] Please justify your decision

The authors have addressed most of the concerns raised by the reviewers, including me. With the promised revisions included this would be a strong paper for publication.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper presents a zero-shot lesion detection method. There are some interesting method components presented in this work and evaluation has been conducted on three public datasets. However, as noted by the reviewer, the performance is quite low and the actual practical need of using zero-shot learning in medical imaging is questionable, considering it wouldn’t be too difficult to get at least a few labeled examples, especially in the problem domains presented by the three datasets.

Author Feedback

We thank all reviewers for the constructive comments and recognizing the novelty and effectiveness of our prompt fusion strategy. In this rebuttal, we mainly address the raised concerns on the practical need of zero-shot detection, where we provide significantly improved fine-tuning results with a few labeled examples, i.e., increase from 36.1% to 50.2% with only 10 examples on the polyp detection, and similarly for the skin lesion detection, from 19.8% to 52.5% with 100 examples. In addition, we add more comparisons to baselines such as Automatic Prompt Engineering (APE) and YOLO where we show our method is in fact preferable under zero/few-shot settings that better suits medical scenarios with label scarcity. We thank all reviewers for the valuable advice and will include more discussions in the revision. ➤R1

More prompt examples, discussions for Fig.1 and 2 Thank you for the suggestion. We will update this in the revision to facilitate the understanding.

More results under 10-shot fine-tunning setting We conduct fine-tuning experiments with 10-shot, and find the performance can be greatly improved. With the same group of multiple prompts, the accuracy of fine-tuned model has increased almost twice as much as that of zero-shot, further demonstrating the effectiveness of our method in both settings. Dataset | ISIC 2016 | CVC-300 0-shot | 19.8 | 36.1 10-shot | 38.2 | 50.2

Explanation for the sub-figure in Fig.2 The result of the second sub-figure shows a rare example where our approach fails to improve the zero-shot detection performance. In cases when multiple prompts are provided very similar to each other, the model will give the same predictions, making our ensemble guided fusion difficult to screen out implausible candidates through clustering and instead select the wrong result by chance. This limitation can be overcome by simply increasing the diversity of the prompts. We will include more discussions in the revision. ➤R2

Comparison with APE We thank the reviewer for bringing up a valuable point. Following your constructive suggestion, we utilize the APE method to generate prompts for evaluation, and these prompts give comparable performance to our single prompt and can still be improved by our fusion method. In addition, we find that although LLMs are better prompt engineers, they still struggle to generate precise prompts with expert-level medical knowledge. Our fusion method is in fact orthogonal to prompt engineering methods. Method | ISIC 2016 | CVC-300 APE prompt1 | 10.0 | 18.7 APE prompt2 | 13.2 | 20.2 APE prompt3 | 13.0 | 17.1 Ours | 15.5 | 37.2

Comparison with YOLO We also conduct a comparison between YOLOv5 and our method on CVC-300. With the same amount of labeled data (e.g., 10-shot below), our method outperforms YOLOv5. In addition, we argue that fully-supervised models such as YOLO may not be suitable for medical scenarios where a large labeled dataset is often not available, whereas large pretrained VLMs have strong transferring ability. The gap between training and test mAP also suggests YOLOv5 has a poor generalization. Model | Train | Test mAP Ours | 60.8 | 50.2 YOLO | 16.2 | 6.4

Discussions of limitations and rephrasing One limitation of our method is that it requires diverse multiple prompts for effective clustering of the candidates. We will include more discussions in the updated version. ➤ R4

Details for reproducibility We will provide more details and release our source code on Github upon acceptance.

Details of the ensemble clustering framework In this module, we take many metrics into account, such as Euclidean distances, candidates IoU etc., and the sum of the squared error (SSE) as the objective function of clustering. As for the number of size clusters, we set it to 3 in order to retain candidates of moderate size.

Mutual independence We consider candidates belonging to different prompts mutual independent since they are generated separately and independently.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal has addressed most of the reviewers’ comments. The authors should revise the paper including the additional experimental results and discussion in the final version.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Though the clinical application maybe preliminary, this paper does have some novel insights to the MICCAI community and may raise some intersting discussing in the meeting. That should aligns with the conference papers’s role. Thus I would happy to see it in MICCAI.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This work aims to improve the performance of vision-language models in the medical domain by addressing the limitations of using a single prompt and the lack of precision in lesion detection. The rebuttal has adequately addressed the major concerns of the three reviewers, including adding experimental results on 10-shot fine-tuning, comparing against other SOTA methods, etc. Thus, this paper is recommended for acceptance.

back to top

Multiple Prompt Fusion for Zero-Shot Lesion Detection Using Vision-Language Models