Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Fan Bai, Ke Yan, Xiaoyu Bai, Xinyu Mao, Xiaoli Yin, Jingren Zhou, Yu Shi, Le Lu, Max Q.-H. Meng

Abstract

Medical image analysis using deep learning is often challenged by limited labeled data and high annotation costs. Fine-tuning the entire network in label-limited scenarios can lead to overfitting and suboptimal performance. Recently, prompt tuning has emerged as a more promising technique that introduces a few additional tunable parameters as prompts to a task-agnostic pre-trained model, and updates only these parameters using supervision from limited labeled data while keeping the pre-trained model unchanged. However, previous work has overlooked the importance of selective labeling in downstream tasks, which aims to select the most valuable downstream samples for annotation to achieve the best performance with minimum annotation cost. To address this, we propose a framework that combines selective labeling with prompt tuning (SLPT) to boost performance in limited labels. Specifically, we introduce a feature-aware prompt updater to guide prompt tuning and a TandEm Selective LAbeling (TESLA) strategy. TESLA includes unsupervised diversity selection and supervised selection using prompt-based uncertainty. In addition, we propose a diversified visual prompt tuning strategy to provide multi-prompt-based discrepant predictions for TESLA. We evaluate our method on liver tumor segmentation and achieve state-of-the-art performance, outperforming traditional fine-tuning with only 6% of tunable parameters, also achieving 94% of full-data performance by labeling only 5% of the data.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_2

SharedIt: https://rdcu.be/dnwxK

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper is to solve the liver lesion segmentation with limited data samples. In the paper, a feature-aware prompt updater is introduced to guide prompt tuning and a TESLA strategy to select unlabeled samples actively. The proposed method achieves competitive results when compared with the model that is trained with a complete training set.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strengths of this paper are listed as follows,

    1. the authors design a feature-aware prompt updater embedded in the pre-trained model to guide prompt tuning in deep layers. This method can reduce the parameters that need to be tuned during fine-tuning, and it will be useful when limited downstream task-related training data is available.
    2. The authors propose a prompt diversity loss (Eq.(2) in the paper) to help the model generate diversified visual prompts. This method is related to their uncertainty selection step.
    3. An uncertainty-based sample selection method is proposed to reduce the annotation cost and maintain competitive performance when compared with the same model that was trained with the full training set.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weaknesses from my side are listed below,

    1. The problem setting is not clear. For your paper, do you assume: (1) you have large unlabeled downstream task-related data (i.e., liver lesion segmentation data in your paper) and limited annotation budget; or (2). you only have limited unlabeled downstream task-related data; (3) others? Please clearly answer us which situation is your problem setting. After reading this paper, I personally prefer setting (1) - you have sufficient data samples for the downstream task (not limited data, as you repeatedly mentioned in your paper). The evidence comes from the last row of Table 2. The model can achieve higher performance when it is trained with full labeled data (i.e., 752 samples). However, if your problem setting is the case (2), active learning or selective labeling will not be helpful in such a case. For example, you only have a limited sample with size n, and it will cause overfitting when fine-tuning with n samples. Then, after active selection, only a subset (say 0.5*n) is selected, but this subset cannot solve the overfitting issue caused by limited-data samples.

    2. It is not clear whether this paper is written as an active learning-related paper or not. If the proposed method (i.e., SLPT) is considered as an active learning (AL) - based method, the three challenges on Page 2 (i.e., the end of the 3rd paragraph in the Introduction section) make me confuse. The SLPT is proposed as a new method other than AL in that part. If NOT, why “Active Learning” is used as one keyword after the Abstract? After reading this manuscript, I personally think this proposed SLPT is one type of AL method, and I did not get how the proposed method addresses the three challenges on Page 2.

    3. Some necessary information about the proposed method is not clear. For example, in Section 3.2, (1). why does the sub-dataset contain 40 patients to conduct the experiment in Table 1? We need more information on why and how these 40 patients are selected; (2). In Table 2, how many rounds of active selection were applied? In other words, the results of Ours in Step-1 are obtained with only 20 patients or more? (3). For active learning, we need to know the model performance for multiple training rounds. For example, we want to see the performance of 20 patients, 40 patients, 60 patients, and so on.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the model can be done after following the content of the paper, except for several places that need to be clarified clearly. Also, the authors claimed that a sample implementation will be provided upon acceptance.

    However, a private liver dataset from the authors’ hospital was used in the paper, and we cannot reproduce the results in Table 1 & 2 without this private dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Please reconsider the claimed three challenges on Page 2 (i.e., the end of 3rd paragraph in the Introduction section). (1). In Active Learning (AL), some task-agnostic models are also used. For example, VGG and ResNet models are pre-trained with ImagetNet and will be fine-tuned in medical domain applications. (2). In AL, the model of certain layers can be frozen too, and some uncertainty-based methods (e.g., entropy) still can work, so I don’t think it is a challenge. (3). It is not a challenge for existing AL methods that do not leverage prompt information as they do not need prompt information.
    2. In Figure 1-(a), the Feature-aware Prompt Updater (FPU), why does the element-wise multiplication operation is needed after attention output is computed? Also, what does the “parameter-efficient depth-separable convolution” stand for? Can you give any motivations, details, and references?
    3. It is not clear how to compute $P_M$ in Section 2.2. Did you use all the samples (941 CT scans in your private dataset) or all the training set (i.e., 752 samples)?
    4. Eq. (2) in Section 2.2 is not correct; when $k_1 = K$, then $k_2$ is out of its domain. Please reconsider the format of this equation.
    5. Additional ablation study of this proposed diversity loss (i.e., Eq.(2)) should be conducted to demonstrate its contribution.
    6. I know $K=3$ is used for the K prompts in Section 2.2; how did you get this $K=3$ value? Please give more details and justify your selection.
    7. Please give more details and motivations about the Tversky loss in Section 2.2. Does it apply to every pixel of predictions and ground truth?
    8. The cross-entropy term in Eq. (4) is also pixel-level binary CE loss. Is that right?
    9. The first sentence of Section 2.3, please reconsider it. I don’t think Active Learning is helpful when data is limited.
    10. In the sentence above Eq. (7), what does “to avoid manual weight adjustment” mean? Please give more explanations.
    11. Will your hospital’s liver tumor dataset be publicly released with your code?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents some novelty in its idea, and the experimental section also provides some persuasive evidence to support its claims. However, it cannot be accepted until the aforementioned concerns are addressed. I am open to reconsidering the decision upon receiving the authors’ responses to these issues.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Most of my questions and comments have been answered. I changed my rating to weak accept. Thank the authors’ efforts.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a method that combines prompt tuning and active learning to reduce data and computation costs when a public pre-trained model is available. The authors introduce a new prompt updater for parameter-efficient tuning of CNNs. For active learning, they select the first batch with a diversity-based method and then use an uncertainty method based on inter-prompt divergence and intra-prompt entropy. The proposed prompt updater outperforms fine-tuning with only ~6% of parameters tunable, and the proposed active learning method shows superior performance compared with other methods while significantly reducing annotation effort.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The combination of active learning and prompt tuning represents a promising direction for reducing computational and annotation costs when transferring public models to private domains.
    2. The proposed prompt updater can be applied to CNNs and achieves superior results compared to fine-tuning and SOTA prompt tuning method while requiring minimal tunable parameters.
    3. The idea of constructing multiple prompts and using multiple prompts to calculate uncertainty is interesting.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Effectiveness for meta-prompt: It is not well-presented why the statistical probabilities of foreground classes are used as meta-prompt.
    2. Some experimental settings are not well-explained. a. How were the 40 patients selected in Table 1? Did you use random selection or the proposed method to choose the 40 patients? It confuses me since the result of the proposed method in Table 1 is different from any of the results in Table 2. b. Are all the results presented in Table 1 and Table 2 the average or the best result in 5-fold cross-validation?
    3. Although the proposed method is evaluated with 5-fold cross-validation as mentioned in Section 3.1, there is no variance or standard deviation presented in the results.
    4. The proposed method is significantly better than random sampling. However, the performance gap is not that large compared with the sub-optimal competitive methods. Are the results statistically significant?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors claim that they will public the code and pre-trained model, but they use in-house dataset for evaluation.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Methodological Improvements: Novelty of the unsupervised diversity selection is limited. The authors can refer to [1] and [2] for inspirations of better unsupervised selection methods (not only with diversity sampling). [1] Zheng, Hao, et al. “Biomedical image segmentation via representative annotation.” AAAI 2019. [2] Hacohen, et al. “Active learning on a budget: Opposite strategies suit high and low budgets.” ICML 2022.
    2. In this paper, the experiments on prompt tuning and active learning are somewhat disconnected and do not demonstrate the benefits of combining both techniques. I think there’s room for better experimental design to emphasize that.
    3. It is unclear whether removing prompts will harm the segmentation performance. According to Figure 2 in the supplementary material, there is little difference in most of the segmentation results even when there are significant differences between the prompts (e.g., prompts 1&3). I wonder what the performance would be if you remove the prompting part in your framework, and I think it could be an additional ablation study.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It’s a well-written paper. The idea of combing active learning and prompt tuning is interesting. The approach is novel and effective. Despite some problems mentioned above, I recommend this paper be accepted.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I was satisfied with their solutions/answers. Thus, I will not change my original rate.



Review #3

  • Please describe the contribution of the paper

    The paper proposes a framework called SLPT (Selective Labeling and Prompt Tuning) that combines selective labeling and prompt tuning to improve medical image analysis using deep learning in limited data scenarios. Previous prompt tuning research overlooked the importance of selective labeling in downstream tasks, which aims to select the most valuable downstream samples for annotation to achieve the best performance with minimum annotation cost. The paper introduces a feature-aware prompt updater, a diversified visual prompt tuning strategy, and a TandEm Selective LAbeling (TESLA) strategy to select valuable samples for labeling. The results show that SLPT outperforms traditional fine-tuning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    One of the main strengths is that the paper makes a significant contribution by proposing the first framework for selective labeling and prompt tuning (SLPT), combining model-centric and data-centric methods to improve performance in medical data-limited scenarios.

    The paper’s contribution lies in the development of a novel framework that combines selective labeling with prompt tuning and the introduction of several novel techniques to improve the effectiveness and efficiency of prompt tuning in limited data scenarios. The proposed methods provide promising results for medical image analysis and have the potential to be applied in other domains with limited data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The challenge addressed in this paper is not clear in the introduction. In particular, the authors mention that rare diseases have particularly small data. Hence, why would expert annotation be a time consuming task?

    Lack of implication discussion: While the proposed method shows promising results in terms of improving performance on limited medical lesion diagnosis data, the paper does not provide any clinical validation to demonstrate its real-world impact in clinical workflows and patient outcomes.

    Some choices are not justified: For example, the authors set k=3 during training, but it is not mentioned in the paper why this specific value was chosen and then updated during testing.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The used software is mentioned. In addition, some parameters are mentioned. However, some more details should be included such as the specific versions of the software libraries used and the exact hyperparameters used for training the models.

    In addition, the paper does not provide a link to the code or dataset used for the experiments. However, the authors mention that the code will be made available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    It would be helpful to provide more details on the dataset used, such as the imaging modality, resolution, and any preprocessing steps. This information would help readers better understand the limitations and potential biases in the data.

    In section 3.2, you mention that your approach outperforms SPM by 1.18% and saves 0.44M tunable parameters. It would be interesting to provide more analysis on whether this improvement is consistent across different types of lesions or imaging modalities. In the conclusion, it would be useful to provide more insight on the limitations and potential implications of your work. Zhang, Y., Chen, T., & Sun, J. (2020). A survey on recent progress in deep learning-based medical image segmentation. Artificial intelligence in medicine, 103, 101793.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper seems to make a valuable contribution, but there is still room for improvement in terms of clarity and robustness.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposed a prompt tuning framework on data-limted lesion segmentation. The overall pipeline of this paper is clear and seems to benefit the scarsity of annotation in medical imaging. It is also surprised and amazing to see the modern prompt learning can be introduced to MICCAI society. However, all three reviewers have pointed out the major weakness of this work including the poor explanations of the settings and the all other related issues. Please carefully prepare the rebuttal to improve the quality of this manuscript.




Author Feedback

We thank all reviewers for their thoughtful comments. We categorize main questions followed by responses in the following paragraphs.

AC&R1&R2&R3: Problem setting. We assume ample unlabeled data in the downstream task, but the labeling is costly, which is common in clinical practice. We employ selective labeling to select a few most valuable samples to label. Meanwhile, we utilize pre-trained models (from sufficient public data) and prompt tuning (on scarce labeled data) to prevent overfitting.

R1&R2: Why and how 40 patients are selected. Since we aim to evaluate the efficacy of prompt tuning on limited labeled data in Tab.1, we create a sub-dataset of approximately 5% (40/752) from the original dataset. Specifically, we calculate the class probability distribution vector for each sample based on the pixel class in the mask and use CoreSet with these vectors to select 40 class-balanced samples. However, in Tab.2, masks are not allowed in selection strategies.

R1&R2: Meta prompt. We refer to https://arxiv.org/abs/2208.10159 and use the statistical probability map to initialize the prompt parameter. Our $P_M$ is based on the mask of the initial labeled data (e.g. 40 samples in Tab.1).

R1: Relationship between selective labeling and AL. While similar, the two concepts differ in certain papers. Selective labeling can be performed in an unsupervised manner without knowing the task or initial data, as Fig.1 https://arxiv.org/abs/2110.03006. Conversely, AL is typically executed iteratively in a supervised manner, knowing the task and initial data. A random initial labeled set is frequently required to train the task-specific model prior to selection, as Problem Definition on Page 2 of the paper https://arxiv.org/abs/2203.13450.

R1: Iterative rounds. We focus on the performance of two steps, unsupervised and supervised selection. In the future, we can iterate multiple rounds of supervised selection (step 1), like AL.

R1: 3 challenge problems.

  1. Regardless of whether pre-trained models from ImageNet are used for initialization, they are not directly employed for medical data selection because they are fine-tuned using random initial data before selection. In contrast, we directly select data using pre-trained model as an unsupervised selection initially, followed by supervised selection. (2) In prompt tuning, since model is frozen, the sample features after tuning will not change, which may cause some AL methods (feature-based) to fail. To solve this issue, we insert FPU in the frozen layers to update the prompt and feature. Of course, as R1 notes, this will not affect some AL methods, e.g., Entropy. (3) We suggest that combining prompt tuning and AL is not straightforward. It is essential to consider their mutual influence. As samples impact the prompt tuning, prompt tuning also affects the value evaluation of samples. We will rewrite these challenges in the final paper.

R1: FPU. The attention mechanism in our model utilizes element-wise multiplication for weighting and is often combined with residual connections, as seen in SE-ResNet or Transformer. Additionally, the depthwise separable convolution is based on Xception.

R2: Results in Table 1&2 The efficacy of our approach has been established through 5-fold cross-validation and comparison with multiple AL methods, as shown by the average results in Table1&2. We omit the variance due to space constraints and may consider adding it in the final paper.

R3: Dataset details. We use enhanced CT in the venous phase with shape 41512512 and normalized spacing 50.70.7mm. Training data is obtained by sampling patches of 24256256. Preprocessing follows nnUnet, random cropping, resizing, and HU windowing.

We thank reviewers for providing constructive comments, such as figures, equations, loss functions, hyperparameters, better unsupervised selection methods, improved problem statements, and relevant references. We will take these suggestions into account in the final paper.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have carefully prepared the rebuttal and addressed most of the concerns from the first-round of review. Although the writing and explanations are still expected to be improved, I would like to give my final rating of ‘accept’. This is a promising work which can benefit the MICCAI society.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Reviewers thought that the proposed combination of active learning and prompt tuning for medical image segmentation should be of interest to the MICCAI community. The rebuttal addressed the most relevant concerns, so I think this work should be accepted after minor revisions, especially to improve clarity.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The reviewers have acknowledged the strengths of the work while also pointing out certain shortcomings. However, the authors have effectively addressed concerns related to explanations of the settings, algorithmic details, and other relevant aspects. This paper exhibits an interesting and valuable contribution. I recommend that the authors carefully incorporate the promised changes in the camera-ready version. Based on these considerations, I recommend accepting the work.



back to top