Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Han Liu, Hao Li, Xing Yao, Yubo Fan, Dewei Hu, Benoit M. Dawant, Vishwesh Nath, Zhoubing Xu, Ipek Oguz

Abstract

Medical image segmentation is a critical task in medical image analysis. In recent years, deep learning based approaches have shown exceptional performance when trained on a fully-annotated dataset. However, data annotation is often a significant bottleneck, especially for 3D medical images. Active learning (AL) is a promising solution for efficient annotation but requires an initial set of labeled samples to start active selection. When the entire data pool is unlabeled, how do we select the samples to annotate as our initial set? This is also known as the cold-start AL, which permits only one chance to request annotations from experts without access to previously annotated data. Cold-start AL is highly relevant in many practical scenarios but has been under-explored, especially for 3D medical segmentation tasks requiring substantial annotation effort. In this paper, we present a benchmark named COLosSAL by evaluating six cold-start AL strategies on five 3D medical image segmentation tasks from the public Medical Segmentation Decathlon collection. We perform a thorough performance analysis and explore important open questions for cold-start AL, such as the impact of budget on different strategies. Our results show that cold-start AL is still an unsolved problem for 3D segmentation tasks but some important trends have been observed. The code repository, data partitions, and baseline results for the complete benchmark are publicly available at https://github.com/MedICL-VU/COLosSAL

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_3

SharedIt: https://rdcu.be/dnwxL

Link to the code repository

https://github.com/han-liu/COLosSAL

Link to the dataset(s)

http://medicaldecathlon.com/


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents the COLosSAL benchmark, a cold-start active learning (AL) benchmark for 3D medical image segmentation. The authors aim to answer three open questions related to uncertainty-based and diversity-based cold-start strategies for 3D segmentation tasks, the impact of a larger budget, and the effectiveness of these strategies when the local region of interest (ROI) of the target organ is known a priori. The research is conducted on five 3D medical image segmentation tasks from the publicly available Medical Segmentation Decathlon (MSD) dataset.

    The main contributions include: -The introduction of a cold-start AL benchmark for 3D medical image segmentation

    • The exploration of the impact of budget and the extent of the 3D ROI on cold-start AL strategies.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -Novelty: The paper introduces the first cold-start active learning benchmark for 3D medical image segmentation, addressing a significant gap in the literature and providing a valuable resource for future research in this area. -Comprehensive Analysis: The authors evaluate six popular cold-start AL strategies on five 3D medical image segmentation tasks from the publicly available Medical Segmentation Decathlon (MSD) dataset, covering two of the most common 3D image modalities and segmentation tasks for both healthy tissue and tumor/pathology. -Clear Research Questions: The paper focuses on three well-defined open questions related to the effectiveness of uncertainty-based and diversity-based cold-start strategies for 3D segmentation tasks, the impact of a larger budget on these strategies, and their performance when the local region of interest (ROI) of the target organ is known a priori. -Reproducibility: The authors make their code repository, data partitions, and baseline results publicly available, promoting transparency and enabling other researchers to reproduce their results and build upon their work. -Practical Recommendations: The paper provides clear recommendations on which cold-start AL strategies are most effective for 3D segmentation tasks, such as TypiClust, a diversity-based approach.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Training procedure: It is not apparent how the authors reduced training stochasticity. Please comment in the rebuttal about seed selection, number of run + averaging and/or use of deterministic mode. As it is known, unfortunately changing the seed can already lead to different results for a model.

    No consistent outperformance: The evaluated cold-start active learning strategies do not consistently outperform the random selection average performance across all tasks. This is not a criticism of the paper, as it focuses on the benchmarking, but affects the findings it reflects. As mentioned in the comments below, some statements regarding performance claims need in my opinion to be tuned down.

    Evaluation metrics: The authors unfortunately only report mean dice, whereas in clinical practice robustness is many times preferred. Reporting of standard deviation values, or the complete spread of values would be appreciated. In the same line, reporting Hausdorff distances would also be nice. See the work of Reinke et al. on metrics pitfalls. The recommendation to use TypiClust is also not clear to me if the variability of these results is not shown and compared to random. The sentence “We further note that TypiClust largely mitigates the risk of ‘unlucky’ random selection as it consistently performs better than the low-performing random samples (red dots below the dashed line).” Reads to me a bit overstated as only on liver and tumor I see a benefit, and spleen a lesser one (maybe the variance of the TypiClust results will make this statement stronger).

    Lack of Comparison with Iterative AL: The paper focuses on cold-start AL strategies without comparing their performance to iterative AL methods. The practical question is what the benefit of using TypiClust over random in combination with a traditional AL method. Does it pay off? Refs [4][25] are probably not representative enough to give an answer to this. This comparison could help researchers and practitioners better understand the trade-offs between cold-start and iterative AL approaches in various medical imaging scenarios.

    The diversity-based methods (ALPS, CALR, and TypiClust) are tested using a 3D auto-encoder for feature extraction. Although this approach addresses the challenge of benchmarking diversity-based methods for 3D tasks, it may not be the most effective way to represent the feature space, and other feature extraction methods could yield different results.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Excellent levels of reproducibility

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    I appreciate the clearness of the paper and its writing. The paper reads very well and focuses on an important point. Below I list some points that hopefully help the authors to improve the study and beyond.

    Why was BRATS left out? It is arguably the largest/popular challenge to date and it also includes multi-sequence MRI. GT’ing in BRATS is also very time consuming, hence the attractiveness of a cold-start investigation for it.

    From the weaknesses, please consider improving the set of metrics being reported. Particularly, please recall that AI models in real-world situations need to be robust.

    M=5 is not clear, why not 3 or a different number? Is there any justification for this specific number? This is a minor comment as I’m not challenging the actual value but the lack of justification.

    The diversity metrics are heavily affected by the quality of the encoding. I suggest mentioning if you think this is relevant. Basically, my concern is that calculating meaningful distances in high-dimensions is not trivial and the non-experienced reader might not know that.

    The results on organs (liver & Pancreas) and tumor are mixed. One of the conclusions is that methods struggle here, so the question is to which extent the tumor part affects the results.

    More of a minor side comment: To me the typicality proposed in TypiClust[6] is the same (or very close) to the Frechet mean (i.e, sample minimizing the distances to all its neighbors). If this is the case, I suggest stating it.

    The results of the uncertainty methods seems to vary quite a lot; this seems to be a reflection of a suboptimal proxy task (thresholding or via Otsu). In AL, uncertainty outperforms diversity, hence, I would expect a better proxy to compensate for the lack of robustness of the uncertainty-based approaches. This is not a criticism to this paper but a remark that the authors can consider for their discussion and future developments.

    Fig3 is really nice to see the impact of budget: I’m not clear about the number said to be the dice difference between each strategy and the mean random. Is it an accumulation?

    I would revisit this statement and change the word “consistently” with “mostly” (and even “mostly” might be overshooting): “using only the local uncertainty or diversity for cold-start AL cannot consistently outperform the global counterparts,” I see at least 6-7 blue blocks in Fig3b (out of 16).

    I would change the word “best” with “robust” in “TypiClust [6 ] stands out as the best option for cold-start”

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty and importance of the subject are really nice. The weaknesses can be solved but the current version already makes the community aware of the topic and provides tools to work on it. The main findings are tool-dependent and hence the recommendations need to be tuned-down or condition to the different elements used, such as auto-encoder, type of proxy task, etc.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper presents a benchmark for cold-start active learning on 3D medical datasets, incorporating 2 CT datasets and 3 MRI datasets. The benchmark entails an analysis of 2 uncertainty-based and 3 diversity-based active learning methods. The paper presents three principal findings. Firstly, a performance comparison between recently proposed cold-start AL methods and the random sampling baseline. Secondly, an examination of the influence of the initial budget on cold-start AL methods. Thirdly, an investigation into whether ROI, if given, would improve the efficacy of cold-start AL methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper’s motivation is clearly stated, with a focus on exploring the underexplored area of cold start active learning. This is particularly pertinent in the context of 3D medical data, where previous active learning methods, mainly validated on 2D natural images, are not easily applicable. Recently, zero-shot recognition using foundation model is becoming a trend. However, it is unclear whether they are applicable to 3D medical data domain since they are mainly trained on 2D natural images. Therefore, active learning for 3D medical data is still an important research problem to solve. The paper presents an analysis of five recent AL methods on five distinct medical datasets, which is a reasonable scope for the study. This analysis is expected to serve as a valuable baseline for further research in the field of active learning for 3D medical data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Figure 2 only reports the average performance without the standard deviation. While it is presumed that one clear advantage of AL methods over the random sampling baseline is consistent performance with smaller variance, the paper does not demonstrate this. • Although the paper acknowledges the limitations, they are not analyzed. Specifically, in Fig. 3, AL methods underperform random sampling on two datasets, LVR and PAN. However, the reason for this is not addressed. • As active learning aims to achieve label-efficient annotation processes, most AL research deploys an iterative process of labeling and training a task model. However, the paper only focuses on the initial stage of active learning. It is unclear how the findings of this cold-start AL study will affect the following iterative stages, as this aspect is not explored in the paper.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper states that the code will be publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    To enhance the paper’s rigor, it is suggested that the authors include the standard deviation in Fig. 2. Additionally, it is recommended that the paper addresses the reasons behind the underperformance of AL methods on LVR and PAN datasets and discuss the circumstances under which AL methods would fail. One limitation of the paper is the absence of an analysis of how the cold-start AL approach will impact the iterative AL process. To address this limitation, the authors may consider including this aspect in either the limitations or conclusion section of the paper.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper present a benchmark for recent cold start active learning (AL) methods, utilizing five distinct 3D medical datasets. Given the underexplored nature of the topic, particularly in the medical domain, the study is both timely and appropriate. The analysis offered in the paper serves as a useful foundation for future research. There are some limitations not addressed in the paper. The paper do not thoroughly investigate the limitations of the AL methods employed. Moreover, the analysis presented is restricted to the cold-start setting alone, with no consideration of the iterative AL setting.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The paper investigates an underexplored topic, cold-start active learning for medical applications. The benchmark and analysis in the paper will be a useful foundation for future research. Although the paper does not propose technically advanced methods, the investigation of the problem and establishing a benchmark are meaningful contributions. The paper would be more impactful if it covered iterative AL settings.



Review #3

  • Please describe the contribution of the paper

    The authors offered the first cold-start active learning benchmark for 3D medical image segmentation. And the code is publicly available.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors offered the first cold-start active learning benchmark for 3D medical image segmentation. And the code is publicly available. The authors have conducted comprehensive experimental results and have three major findings. The paper is well-written and the techniques used in the paper are reasonable. The methods and datasets used in this paper are all publicly available, which can be an asset of the research community.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The issue with this paper is that the clinical value is very limited. The authors only collected some state-of-the-art methods and conducted an experimental one with publicly available datasets. The original techical contributions are very limited as well. The problem presented in this paper is not very valuable. As far as the reviewer is concerned, there are publicly available medical datasets with annotations. So transfer learning and self-supervised learning have been shown to be very efficient to make the learning process warm-started. It is not wise to do the cold-start.
    The authors findings are not promising. For example, it is very easy to learn that the more the budget, the better the performance.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The methods and dataset used in this paper are all publicly available. The code is also publicly available. So the reviewer believes that the reproducibility of the paper is great.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The reviewer’s comments are as follows.

    1. The application of the proposed is too narrow, i.e., 3D medical segmentation. And the 3D medical segmentation task is well-defined and solved. The authors may want to incorporate more applications e.g. classification and detection.
    2. The contributions need to be highlighted. The reviewer can only learn that the authors have conduct some experimental results by using existing methods.
    3. More comparisons with warm-start method may be needed. The pre-trained models by using transfer learning or self-supervised learning are available. The author should persuade the audiences why the cold-start is needed.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    2

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The reviewer’s opinion the problem presented in this paper is not of great clinical importance and there is no technical novelty. So contributions are very limited.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper presents a benchmark and performance characterization of recent cold start active learning methods on 3D medical image segmentation. Reviewers appreciated the systematic approach with multiple leading methods for the underexplored 3D segmentation task. However, there are important concerns that need to be addressed in a rebuttal phase.

    Suggest that the authors craft a response to reviewers’ comments with a focus on: 1) explaining how their findings may affect follow-on iterative AL stages that would typically be employed in practice 2) reviewing other warm-start methods and better motivate their focus on the narrow cold-start problem 3) providing the requested methodological details incl. justification of experimental choices and chosen performance metrics 4) qualifying the main findings based on the proxy task type, choice of autoencoder, methodological limitations and addressing reasons for underperformance 5) highlighting their contributions clearly and connecting these to directions for future research




Author Feedback

Impact on iterative AL [R1,R2,AC]: Previous research [4,13,25] suggests better initialization improves subsequent warm-start AL iterations. However, we have not empirically demonstrated this. We note that iterative AL would require significantly more computation and is beyond the scope of the current work. Besides, cold-start AL methods are not always followed by iterative AL (see next response). We will include this in Discussion.

Cold-start AL is narrow [R3,AC]: Cold-start AL has many practical uses as stated in the 2nd paragraph in page2: besides improving follow-up warm-start AL, cold-start AL also aims to study the general question of constructing a training set for an organ that has not been labeled in public datasets. This is a very common scenario (whenever a dataset is collected for a new application), especially when iterative AL is not an option.

Only mean Dice reported [R1,R2,AC]: We will update the tables in Supp with stdev of Dice. Hausdorff distance is not suitable for tumors (as there may be multiple tumors per image), but we will include it for the other 3 datasets.

Qualify main findings [R1,R2,AC]: The choice of proxy task type [13] and autoencoder [25] is based on mainstream techniques in cold-start scenarios. We agree that other approaches for feature extraction and uncertainty estimation may have different impacts on performance and we will include this in Discussion.

Reasons for underperformance [R2,AC]: The main contribution of our paper is to provide a benchmark to evaluate AL methods. For example, if one only evaluates ProxyRank-Ent on the spleen dataset, one might prematurely conclude that it outperforms random sampling, whereas our benchmark reveals its weakness in other scenarios such as LVR and PAN. The manuscript includes our current hypotheses about the underperformance (“the uncertainty-based methods heavily rely on the uncertainty estimated by the network trained on the proxy tasks, which likely makes the uncertainty of tumors difficult to capture”). Further analysis merits its own dedicated study in future.

Contribution [R3, AC]: We offer a benchmark for an important yet underexplored research topic.

Training procedure [R1]: For all experiments we use the deterministic training mode in MONAI with a fixed random seed=0.

Overstated sentence for TypiClust [R1]: We clarify that we focus on risk mitigation instead of overall superiority. The risks are the ‘unlucky’ random selections represented by the red dots far below the dashed line. TypiClust is robust across most datasets and better than many unlucky red dots.

Limited clinical value [R3]: We respectfully disagree. Given a new segmentation task, the annotations of target organs may be unavailable or defined differently in public datasets. Such scenarios require a cold-start approach. R1 and R2 also agree with the significance of our study: “The novelty and importance of the subject are really nice.“ [R1] and “Given the underexplored nature of the topic, particularly in the medical domain, the study is both timely and appropriate.” [R2]

Transfer/self-supervised learning can warm-start the learning process [R3]: We clarify that warm-start in AL is different from warm-start in model initialization. Even if the model is initialized via transfer/self-supervised learning, to fine-tune a model to the downstream task, one still needs to label images on the target dataset. Choosing which samples to label from target dataset is still a cold-start problem.

It’s easy to learn that more budget improves performance [R3]: We clarify that we do not just show that each method improves with higher budget, which is indeed trivial. Rather, we compared the AL vs. Random under different budgets. Without this experiment, it is unknown whether the AL methods become more effective or less effective when more budgets are available. In fact, we find that some, but not all, AL methods’ effectiveness over Random becomes more pronounced with higher budget.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper presents a benchmark and performance characterization of recent cold start active learning methods on 3D medical image segmentation. The topic of the paper is novel, interesting and practically relevant, although underexplored. The rebuttal has addressed many of the questions raised by reviewers, and the clarifications on methods and results are appreciated.

    Recommendation is to accept. Will be important for the authors to amplify the introduction, related work and discussion in line with all feedback and rebuttal responses in camera ready.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Based on the three reviews, all agree that the paper is of importance because it explores a new area that was not researched before in medical image processing – cold-start active learning and established first results for benchmark for 3D medical image segmentation. While the novelty may be somewhat limited, I believe that as a conference paper it is of value to publish these first results. So I disagree with the harsh reject of Reviewer 3.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This work benchmarks five methods for cold-start active learning on five 3D medical image segmentation tasks for which data was publicly available and reports the results, guided by three main research questions. Reviewers questioned several important aspects of the way the benchmark has been conducted, such as focusing entirely on Dice, excluding popular datasets such as BRATS without clear reasoning and presenting a biased impression of active learning by showing the variability in the results of random selection, but hiding the variability in the results of active learning. Authors promise to report standard deviation and add further metrics in the final version, but I expect that this is likely to affect their conclusions and I believe that it should be followed by another round of reviews. Another point, which is raised by R3, is how much value we can expect from this work. Even though I agree with the authors that the cold start problem as such is relevant, I agree with R3 that current results do not make a compelling case for any of the studied methods. As the authors state themselves, “We find that almost no AL strategies are very effective for the segmentation tasks that include tumors”. Given that the paper does not propose any ideas on how to fundamentally advance this field, it is not clear to me whether the problem that this paper is trying to promote – without presenting a new technical contribution itself – is indeed an exciting and under-explored research direction, or rather a dead end, especially if we consider that the practically relevant baseline should not be random selection (which is already hard to beat) but a human expert selecting cases for annotation: In 3D per-voxel labeling, the effort for manual selection will be much lower than for manual annotation, so it seems perfectly reasonable to involve an expert already at the selection stage.



back to top