Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Muhammad Asad, Helena Williams, Indrajeet Mandal, Sarim Ather, Jan Deprest, Jan D’hooge, Tom Vercauteren

Abstract

Existing interactive segmentation methods leverage automatic segmentation and user interactions for label refinement, significantly reducing the annotation workload compared to manual annotation. However, these methods lack quick adaptability to ambiguous and noisy data, which is a challenge in CT volumes containing lung lesions from COVID-19 patients. In this work, we propose an adaptive multi-scale online likelihood network (MONet) that adaptively learns in a data-efficient online setting from both an initial automatic segmentation and user interactions providing corrections. We achieve adaptive learning by proposing an adaptive loss that extends the influence of user-provided interaction to neighboring regions with similar features. In addition, we propose a data-efficient probability-guided pruning method that discards uncertain and redundant labels in the initial segmentation to enable efficient online training and inference. Our proposed method was evaluated by an expert in a blinded comparative study on COVID-19 lung lesion annotation task in CT. Our approach achieved 5.86% higher Dice score with 24.67% less perceived NASA-TLX workload score than the state-of-the-art. Source code is available at: https://github.com/masadcv/MONet-MONAILabel

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_53

SharedIt: https://rdcu.be/dnwzl

Link to the code repository

https://github.com/masadcv/MONet-MONAILabel

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a multi-scale online likelihood network (MONet) for scribbles based interactive segmentation of lung lesions in CT volumes from COVID-19 patients. An adaptive online loss is formulated for the multi-scale feature extractor to learn from both the initial segmentation and the label corrections provided by the scribbles. Additionally, a probability-guided pruning method is proposed that discards uncertain and redundant labels in the initial segmentation to enable efficient online training and inference. The proposed method is evaluated on the public UESTC-COVID-19 dataset of 50 CT volumes that are annotated by expert annotators. The experimental results showed that the proposed method obtained overall better performance and less perceived NASA-TLX workload score.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • An adaptive online loss is formulated for the multi-scale feature extractor to learn from both the initial segmentation and the label corrections provided by the scribbles. • A probability-guided pruning method is proposed that discards uncertain and redundant labels in the initial segmentation to enable efficient online training and inference. • Apart from the evaluation of tumor segmentation on a public dataset, a NASA-TLX comparison is conducted to evaluate the workload for annotators between the proposed method and one existing method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Multi-scale feature integration is a commonly used technique in feature extraction using CNN. • The adaptive online loss is a weighted combination of two different losses, the technical novelty is limited. • Besides the multiscale part, lacking ablation studies to demonstrate how different components contribute to the segmentation.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    If the code will be publicly available, there would be no issue with reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    • Apart from the current experiments, it would be better to compare the proposed method with the methods mentioned in related work, such as [1-3]. Also there seems to be other work using user scribbles to segment lesions on the same dataset, including [4]. Please compare and differentiate from the other latest relevant work. • It would be better to see how the latest end-to-end segmentation methods perform on the public COVID-19 dataset as this problem has been well-researched. • The testing dataset is relatively small, it would be better to evaluate the proposed method on multiple datasets as there are many public COVID-19 datasets.

    [1] Wang, G. et al.: Dynamically balanced online random forests for interactive scribble based segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 352–360 (2016) [2] Wang, G. et al.: DeepIGeoS: a deep interactive geodesic framework for medical image segmentation. IEEE transactions on pattern analysis and machine intelligence 41(7), 1559–1572 (2018) [3] Wang, G. et al.: Interactive medical image segmentation using deep learning with image-specific fine tuning. IEEE transactions on medical imaging 37(7), 1562–1573 (2018). [4] Liu, X., et al.: Weakly supervised segmentation of COVID19 infection with scribble annotation on CT images. Pattern recognition (122), 108341 (2022).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    • There was not enough technical contribution for MICCAI, the proposed method consists of several commonly used components. • Lacking comparison to the latest state-of-the-art methods. The experimental results showed a marginal improvement with a more sophisticated architecture. • The testing dataset is relatively small, it would be better to evaluate the proposed method on multiple datasets as there are many public COVID-19 datasets.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    In terms of online interactive segmentation, this paper has great clinical value with reasonable evaluation. The technical novelty is still limited, but it is understandable when considering computational efficiency in the setting with human in the loop.



Review #2

  • Please describe the contribution of the paper

    To overcome the lack of quick adaptability to ambiguous and noisy data in CT volumes containing lung lesions from COVID-19 patients, they propose an adaptive multiscale online likelihood network (MONet) that adaptively learns in a dataefficient online setting from both an initial automatic segmentation and user interactions providing corrections.Their approach achieved 5.86% higher Dice score with 24.67% less perceived NASA-TLX workload score than the stateof- the-art in an expert evaluation in a blinded comparative study on COVID-19 lung lesion annotation task in CT.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. They proposed an adaptive online loss that utilized adaptive weights based on user-provided scribbles that enabled adaptive learning from both an initial automated segmentation and user-provided label corrections.
    2. The key challenge is useful in clinical scenarios.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. More theoretical contents in methodolody is recommended.
    2. The contribution part is expected to emphasize the expert evaluation other than the multi-scale part, which is a little trivial.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the paper is good with code to be released. Expert evaluation is introduced clearly to reproduce.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. More advanced DL Feature extraction method, and other SOTA tools in DL domain is recommended.
    2. Contribution parts can be revised.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    To overcome the lack of quick adaptability to ambiguous and noisy data in CT volumes containing lung lesions from COVID-19 patients, they propose an adaptive multiscale online likelihood network. The network part is not that advanced. Experiments are extensive with clinical expert evaluation.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose an adaptive multi-scale online likelihood network (MONet) that adapts to ambiguous data by incorporating human-in-the-loop guidance signals in the form of scribbles. This is achieved by an adaptive loss function that amplifies the influence of user-provided scribbles to adjacent regions with similar features. The model is trained online, i.e., with real-time human interactions instead of traditional interaction simulation schemes (robot users). MONet integrates an uncertainty-guided pruning of online training examples to exclude ambiguous training data with low model confidence. The method achieves both a higher segmentation performance (Dice) as well as a lower user workload (NASA-TLX) on a COVID-19 lung lesion annotation task using CT compared to state-of-the-art online likelihood models (ECONet, GraphCut) and exponential geodesic distance-based interactive segmentation methods (MIDeepSeg).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Demonstration of clinical feasibility: MONet is more efficient than previous online-likelihood approaches (ECONet, GraphCut) and more robust to different object scales due to the multi-scale convolution layer. The efficient online learning makes it possible to adapt to specific patients on-the-fly and fine-tune the model in real-time. This is especially relevant in clinical practice, where there is a large inter-patient variation and the fast training and deployment of patient-specific models is often desired.

    A particularly strong evaluation: The authors conduct extensive experiments, comparing to related online approaches, and conduct a user study, focusing on the usability of their interactive framework. The usability study is especially important since interactive segmentation models are human-centered and this gives a better impression of how the model is perceived by annotators than general Dice@NoC (Dice at number of clicks) metrics seen in most papers in this field, albeit the user study size is quite small (n=1 annotator).

    Well-structured justification of design decisions: The method is presented with very detailed equations which facilitates the reproducibility of the method and justifies the decisions behind the design of the model. The application of geodesic distance makes sense, since the context of neighboring voxels should also depend on the visual features, not only on spatial. The addition of the multi-scale convolution is also well-justified and does not worsen the efficiency of the model in comparison to previous online-likelihood approaches.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some parts of method presentation: The authors do not explain well how the patch-wise training is converted to fully-convolutional inference during evaluation. Page 3: “The output of each scale is concatenated and fed to a fully-connected classifier, which infers the likelihood for back- ground/foreground classification of the central voxel in the input patch.” seems to imply that the forward pass only segments/classifies the center voxel in the patch. Does this mean that the fully-convolutional inference slides through the volume voxel-by-voxel or is there a mistake in the wording in this sentence?

    Misleading “Scribbles” metric: Results from the synthetic scribbler might be misleading. The synthetic scribbler proposed in DeepIGeoS [1] and used in ECONet [2] samples more voxels if the under- and oversegmented regions are large. While this means that the model requires more scribbles if it produces a worse segmentation (i.e. larger erroneous regions), it is a more characteristic metric of the synthetic scribbler than the models evaluated with it. For example, the first interactive iteration might lead to large erroneous regions, and hence, to more sampled scribbles, but require much fewer scribbles in the following 9 iterations (assuming 10 iterations as in ECONet [2] were used). I would argue that the “amount of scribbles” metric in Table 2, last column, would make much more sense in a real user study than when using the synthetic scribbler.

    [1] G.Wang et al. DeepIGeoS: a deep interactive geodesic framework for medical image segmentation, T-PAMI 2018 [2] M.Asad et al. ECONet: Efficient convolutional online likelihood network for scribble-based interactive segmentation, MIDL 2022

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility seems to be good. The authors use publicly available datasets for training/evaluation and have implemented their code with MONAI Label to facilitate reproduction of results and access to the community.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The paper is well-written overall, with understandable equations, justified design decisions, and a clear structure. However, there are a few points that require clarification, as mentioned in the weaknesses section.

    (1) One area that needs clarification is how the model transitions between patch-wise training and fully-convolutional inference. While the paper states that only the center voxel is classified, it is unclear how the inference is executed. Is stride 1 used over each voxel?

    (2) Additionally, the authors do not explain why the constants alpha_f, alpha_b, beta_f, and beta_b are all multiplied by the factor T = C + S . Could this constant be omitted?

    (3) While not a major weakness, the user study could benefit from more participants with varying levels of experience. However, it is understandable that obtaining annotations from experts can be time-consuming and expensive.

    (4) To help readers follow the equations more easily, the authors should include a notation table in the supplementary materials.

    (4) The authors should note that in Figure 1, the negative exponential of the geodesic distance is used, which means that large values are located near the scribbles/clicks. Therefore, this is more of an “inverse” distance metric.

    (5) Following up on the weakness from the evaluation with the synthetic scribbler. The authors use the number of annotated voxels with scribbles as a metric to quantify the annotation efforts for each model. However, this metric characterizes the synthetic scribbler more than the evaluated models. The amount of annotated scribbles S is directly linked to the size of each missegmented region (i.e. S =ceil( V_m /1000), where V_m is a missegmented region with enough voxels. This number might be quite large at the first interactive iteration and be drastically reduced with more scribbles. The random sampling of these voxels is also not representative of the way real human annotators annotate volumes, e.g., with continues strokes. Finally, the authors do not elaborate how exactly they compute the number of voxels: do you also consider the number of scribbles from the 0th iteration where the whole ground-truth mask is used a large missegmented region or do you only use the consecutive iterations.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the paper has several minor weaknesses, I believe it will be a very valuable contribution to the community. Based on the following arguments the recommendation for paper acceptance is justified:

    (1) Clinical Relevance: The efficient online training is quite relevant for the fast deployment of interactive segmentation models in clinical practice, and there are few deep learning online likelihood models ([1] ECONet, [2] Längkvist et al) which implement this strategy. Thus, the paper presents a practical solution to an important data collection problem.

    (2) The paper is well-written, with well-argued design decisions and extensive experiments. Thus, the paper meets the standards of a high-quality research contribution.

    [1] M. Asad et al. ECONet: Efficient convolutional online likelihood network for scribble-based interactive segmentation, MIDL 2022 [2] M. Längkvist et al. Interactive user interface based on Convolutional Auto-encoders for annotating CT-scans, arXiv 2019

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    While the authors answered the other reviewers’ concerns regarding novelty, dataset size, and lack of comparisons to SOTA methods, they did not address my concerns regarding the evaluation metric (Scribble Length) and inference strategy (central voxel classification). Hence, I am lowering my rating. Nevertheless, the paper is still valuable to the community and would be a good contribution to MICCAI 2023




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    While the reviewers recognize the paper proposes a loss function based on adaptive weights for user-provided scribbles in the context of interactive segmentation, and adresses an important problem for clinical scenarios, the reviewers were fairly mixed with critiques in terms of the novelty and how the proposed loss functions demonstrates an advantages. The authors should address these points in the rebuttal.




Author Feedback

We thank the reviewers for their constructive feedback. Reviewers acknowledged the clarity, methodology, novelty, and detailed experimental and clinical expert validation. Below, we reply to specific technical comments: Novelty (MONet/Adaptive Loss/End-to-end) [MR/R1/R2]: Our method introduces a novel online interactive segmentation model, which enables rapid adaptability with minimal latency for human-in-the-loop AI-assisted annotations. It is important to note that MONet should not be mistaken for an end-to-end segmentation model, as R1’s interpretation suggests. Although the proposed multi-scale feature extractor is commonly used in CNN, its efficient utilization within an online learning setting has been unexplored. Similarly, the novelty of our adaptive loss is in combining two loss terms, to learn from (i) automatic segmentations and (ii) user-provided corrections, using spatially varying weights defined by exponential geodesic distance of user interaction. As mentioned on page 2 (para 1), existing SOTA for online methods (i) lack multi-scale features and (ii) learn from user-provided interactions only. Clinical evaluation by an expert also highlights the novelty of our method, where it results in reduced annotation workload (NASA-TLX) with improved accuracy as compared to existing SOTA. Utilize advanced DL method [R2]: In addition to the comment above, note that the proposed method is targeted for online training and inference, where quick adaptability with minimal latency is required. Note that incorporating more advanced DL methods in this context would result in a considerable decrease in online efficiency, rendering the method impractical for online applications [2]. Comparison with existing SOTA [R1]: Our proposed method is an online interactive segmentation method, we compare with ECONet, the SOTA in online interactive segmentation. In addition, we compare with MIDeepSeg, the SOTA in interactive segmentation methodology. ECONet and MIDeepSeg have been previously shown to outperform existing methods which include DybaORF R1[1], DeepIGeos R1[2], BIFSeg R1_[3]. As shown in our experimental validation, our proposed method outperforms both ECONet and MIDeepSeg(Tuned) in terms of accuracy, efficiency and clinical workload as evaluated by an expert. Experimental evaluation and test set size [R1]: Our proposed method is an online interactive segmentation model where our evaluation focus is to show that the proposed method can be utilized to perform online human-in-the-loop AI-assisted annotations with reduced workload and improved accuracy. Hence, our experiments do not show an end-to-end segmentation evaluation on large COVID datasets. We utilize two different datasets to simulate a scenario where initial automatic segmentation models have been trained using a data source that is different from the one in the clinical setting. We use a large COVID-19 CT challenge dataset [21] for training of UNet and pre-training online models. We use a second UESTC-COVID-19 dataset [27] for evaluating annotation accuracy. As we propose an AI-assisted interactive annotation method, therefore the evaluation aimed to use only the most reliable 50 expert annotations within the UESTC-COVID-19 dataset. Both datasets are publicly available and used within the community to evaluate segmentation methods. Ablation experiments [R1]: As shown in Table 1 and 2, our paper includes two ablation experiments (i) multi-scale feature ablation where MONet (multi-scale) is compared against MONet-NoMS (without multi-scale) and (ii) adaptive loss ablation where MONet-NoMS (adaptive loss) is compared against ECONet (without adaptive loss). The supplementary material provides two additional experiments: (Sup_Fig. 1) study the impact of temperature term \tau in Eq. 2 and (Sup_Fig. 2) study the impact of \lambda and \rho on GraphCut-based regularization. We will incorporate above responses and all constructive recommendations from R3 and R2 in revision.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Following the author’s rebuttal which addresses most of the reviewers main comments on novelty and comparison to SOTA, I believe the paper presents interesting contributions to the field of interactive image segmentation. The paper introduces several interesting concepts, such as an online loss formulated for the multi-scale feature extractor and probability-guided pruning. R2 is more pessimistic about the paper, however doesn’t provide sufficiently detailed arguments to justify rejection and did not follow-up on the author’s rebuttal. I am therefore leaning towards acceptance.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper describes a method for AI-assisted interactive segmentation. Reviews were mixed about novelty, but the authors have elaborated on this in their rebuttal. Moreover, they have clarified some points, which has let to one reviewer changing his opinion from reject to accept. Overall, I think there is sufficient support for an accept, and the paper is interesting for MICCAI.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper introduces a novel online interactive segmentation model for human-in-the-loop AI-assisted annotations. Although the technique is largely an adaptation of existing approaches (to cater to online interactive annotation setting), this is a good application paper with a practical, efficient and clinically evaluated solution (albeit only n=1 for human-in-the-loop evaluation). Hence, recommendation is to accept.



back to top