Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Tomer Amit, Shmuel Shichrur, Tal Shaharabany, Lior Wolf

Abstract

A major challenge in the segmentation of medical images is the large inter- and intra-observer variability in annotations provided by multiple experts. To address this challenge, we propose a novel method for multi-expert prediction using diffusion models. Our method leverages the diffusion-based approach to incorporate the information from multiple annotations and fuse them into a unified segmentation map that reflects the consensus of multiple experts. We evaluate the performance of our method on several datasets of medical segmentation that were annotated by multiple experts and compare it with the state-of-the-art methods. Our results demonstrate the effectiveness and robustness of the proposed method.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_52

SharedIt: https://rdcu.be/dnwD0

Link to the code repository

https://github.com/tomeramit/Annotator-Consensus-Prediction

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a segmentation method from multi annotator data utilizing a diffusion model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -The diffusion model was effectively utilized to make consensus from ground truth data made by multiple annotators. -The proposed method achieved better segmentation accuracies in the multi-organ segmentation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -The number of annotators and variation of annotation quality affects segmentation results. Relationships between them and the proposed model are not clear. -Most of the previous methods used in the comparative study are old methods.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    They used the public datasets. The code will be published.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    -Can you compare segmentation performance of your method with more recent methods? -What is “MV-UNet[19]”? Cited paper [19] is U-Net.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method effectively utilizes the diffusion model to perform segmentation from multiple annotations. The novelty of the method is good.

  • Reviewer confidence

    Not confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a diffusion-based model for multi-annotator prediction to improve image segmentation performance. The proposed approach merges multiple annotations from the same image to generate a binary mask.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper presents an important topic that is gaining momentum. The proposal appears technically sound and significantly improves soft Dice results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In general, the article is challenging to read due to the shallow methodology and the lack of details in figures.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Database is available, but no code is provided nor mention will be provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The authors should avoid citing arXiv papers since they are not necessarily peer-reviewed. I recommend the authors to include references only from articles accepted in conferences or journals.

    Additionally, all acronyms should be defined in the paper.

    The authors should consider making the letters in Fig. 1 bigger to improve readability.

    I recommend shortening Section 2 and including a more detailed figure of the architecture.

    In Methodology and Ablation sections, the authors should expand and include a more detailed figure.

    Captions should use “Dice” instead of “dice”.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the topic is very interesting, the authors does not provide a detail analysis of their proposal. There is no discussion about its clinical impact.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The authors claim that they will incorporate some cosmetic changes based on the reviewers’ comments. This will improve the quality of the paper to some extent. Therefore. I have changed my opinion from weak reject to weak accept.



Review #3

  • Please describe the contribution of the paper

    In this work, the authors mainly use the diffusion model for the multi-annotator prediction. They use the diffusion model to create maps for different level of consensus and average these maps for the final soft maps. The experimental results show the effectiveness of their proposed model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper is well-written and easy to follow. The idea using Diffusion to encode different annotated masks is interesting and makes sense. Will the authors release the codes and models? If yes, it would be much better for this community. The experimental results outperforms the existing SOTA.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are two main concerns. The first one is potential unfair comparison in Table 1, since the existing SOTA methods might only output one prediction, but the proposed method use 20+ maps and average them for the final predictions. It seems that the authors need to point it out and it might be the reason that their method performs so high than others, the gaps. But from the figure 3, we can see, if only using 1 time, the performance might be worse or better than MRNet. Another concern is ‘Section 4.3.4 Generalization Capability’ in [12], it seems the authors just directly copied the numbers from Table 6, while the title is about the Generalization. Thus, I suggest the authors could confirm they use the same setting with [12], and what is the meaning of ‘Generalization’. The second one is the missing baseline, according to the average strategy. Why not use VAE+UNet as baseline to generate multiple maps and average them?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Yes, the authors provide the implementation details, such as the platform, the main parameters.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    I suggest the authors could add fare baselines and explain the setting and comparison is fair.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See the weakness

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper presents a diffusion based model for multi-annotator prediction to improve image segmentation performance. The reviewers found the idea using Diffusion to encode different annotated masks is interesting and makes sense. They however raised several concerns. In their rebuttal the authors should carefully address the issues and questions raised, provide the requested methodological details, refer to the relationship between the number of annotators and segmentation results, address the questions related to comparison and baseline, provide detailed analysis, and clarify figure cations.




Author Feedback

We thank the reviewers and the AC for their valuable feedback. All requests for elucidation will be fully addressed.

Regarding the points highlighted by the AC. Re: studying the number of annotators vs. segmentation results, please see R1(1). Re: comparisons and baselines, please see R1(2). Re: analysis and clinical impact, please see R2(1),R3(2). Re: Captions and readability, please see R2(2,3).

R1:

  1. To analyze the relation between the annotators agreement to our model performance, we averaged the Dice score between every two annotators over the dataset. Higher mean scores mean more consensus.

The obtained agreement scores are 94.95/85.74/90.65/94.64/89.91 for kidney/brain/tumor/prostate1/prostate2 datasets. Our method performs better on datasets with higher agreement (kidney and prostate1). The other methods get much worse on the kidney dataset, which lowers the correlation between the score and the performance. There is no correlation between the number of annotators and our model’s performance.

  1. Following the review, we compared our model to DMISE [arXiv2112.03145]. Our method differs 2 major ways: Our added input image encoder, and much faster inference (100 vs. 1000 inference steps). For a fair comparison, we ran DMISE with the same novel consensus setting (lower performance without it). For the five datasets above (always in this order) we obtain soft Dice of 96.58/93.81/93.16/95.21/84.62 vs. 74.5/92.80/87.8/94.70/80.20 for DMISE. The number of annotators is 3/6/3/7/7 and notably, the gap widens for fewer annotators.

Following the request of R3, we used VAE+UNET as an additional baseline. While we managed to achieve plausible single-sample results, we observed no significant diversity across different samples. Therefore, running multiple times and averaging didn’t improve the results. The overall performance was not very impressive.

Another comparison was performed with a recent method from CVPR’23 [arXiv:2304.04745], which employs a diffusion model with KL divergence to capture variability. The Dice scores obtained for the BEST runs (many runs completely failed) on the five datasets are 68.53/74.09/92.95/91.64/21.91. For a fair comparison, we conducted an additional ablation experiment where images were paired with random annotators, without utilizing the annotator ID obtaining 94.46/89.78/91.78/92.58/78.61. Our reduced method (ablation) outperforms theirs in 4/5 datasets, but with 100% success rate and no significant variance between runs. We are also 10x faster since the CVPR’23 uses 1000 steps.

  1. The method ‘MV-UNet’ was introduced in reference [12]. Sorry for the wrong reference.

R2:

  1. See R1(1) for an analysis, which implies that the amount of data that is needed is proportional to the disagreement between annotators. We would write about the possible clinical implications of obtaining not only a prediction but also a consensus. For example, the shape of a tumor can be a strong indication of its malignant nature, but regions with little consensus may significantly affect the shape of the inferred segment.

  2. We will define all acronyms.

  3. We will revise the figures, increase the font size, etc.

R3:

  1. We will release the codes and models to benefit the community.

  2. The ability to perform multiple runs for the same sample, due to stochasticity, is an advantage of generative models that has been exploited in previous work [arxiv2112.00390, arxiv2112.03145]. It is not the same as training an ensemble of models, but it does make inference slower. Like previous contributions, we view it as a fair comparison, and as noted by the reviewer, also report for a single repetition. Note that we employ repetitions in exactly the same way also for all diffusion model baselines and ablations.

  3. We verified that the evaluation protocol is identical to the baseline method.

  4. Regarding VAE+UNet, please see the response to R1 item 2.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have made a significant effort to address the reviewers’ concerns and they did so in a convincing manner. The key idea is interesting and novel and the paper should be accepted to the MICCAI. In a future work the authors may consider calculating the overall raters’ consensus (as a form of validation) in the spirit of STAPLE (Warfield et. al. 2004).



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper presents an innovative approach to image segmentation, utilizing diffusion models to simulate the consensus among multiple annotators. The proposed idea demonstrates novelty, and the integration of diffusion models proves to be effective. While several concerns were raised by the reviewers, the authors successfully addressed them in the rebuttal, providing satisfactory explanations regarding method details, the correlation between the number of annotators and segmentation results, as well as comparisons with the baseline. Notably, R2 raised their score from weak reject to weak accept. Consequently, based on careful consideration, I recommend accepting this paper.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper presents a novel idea of using diffusion model for multi-annotator prediction. The proposed model improves the medical image segmentation performance effectively. The authors addressed the reviewers’ concerns and questions in their rebuttal letter. I recommend this paper to be accepted in MICCAI.



back to top