Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Junde Wu, Huihui Fang, Zhaowei Wang, Dalu Yang, Yehui Yang, Fangxin Shang, Wenshuo Zhou, Yanwu Xu

Abstract

The segmentation of optic disc (OD) and optic cup (OC) from fundus images is an important fundamental task for glaucoma diagnosis. In the clinical practice, it is often necessary to collect opinions from multiple experts to obtain the final OD/OC annotation. This clinical routine helps to mitigate the individual bias. But when data is multiply annotated, standard deep learning models will be inapplicable. In this paper, we propose a novel neural network framework to learn OD/OC segmentation from multi-rater annotations. The segmentation results are self-calibrated through the iterative optimization of multi-rater expertness estimation and calibrated OD/OC segmentation. In this way, the proposed method can realize a mutual improvement of both tasks and finally obtain a refined segmentation result. Specifically, we propose Diverging Model (DivM) and Converging Model (ConM) to process the two tasks respectively. ConM segments the raw image based on the multi-rater expertness map provided by DivM. DivM generates multi-rater expertness map from the segmentation mask provided by ConM. The experiment results show that by recurrently running ConM and DivM, the results can be self-calibrated so as to outperform a range of state-of-the-art (SOTA) multi-rater segmentation methods.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_59

SharedIt: https://rdcu.be/cVRsr

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper describes a self-calibrated optic disc (OD) and cup (OC) segmentation model from multi-rater annotations. The major contributions are:

    1. A recurrent learning framework is proposed for the self-calibrated segmentation using multi-rater annotations.
    2. In the proposed framework, two models are designed for recurrently learning the multi-rater expertness maps, and separating multi-rater annotations from the estimated segmentation masks.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The novelty of the use of multi-rater annotations. The authors propose a recurrent method to fuse-separate the annotations from multiple raters. The way that the DivM recurrently returns its output to the ConM makes good use of the Attention mechanism. Furthermore, the authors provide the proof of the estimation of multi-rater expertness from the segmentation, which is theoretically persuasive.
    2. The experiments and the evaluation are robust. The authors’ evaluation methods are sound and the result comparison with SOTA is clear (the authors have taken both the calibrated and non-calibrated methods into account).
    3. The paper is well organized and fluently written.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No major concern can be raised for this paper.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The experiments have been conducted on a publically available dataset. Most of the implementation details are provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Just some minor problems:

    • references should be checked, such as the redundancy of 20 and 21;
    • typo such as “This enable us to supervise DivM…”
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The proposal is sound and the explication is in detail with theoretical proof.
    2. The evaluation method is very robust and the result is satisfying.
    3. The paper is quietly well organized.
  • Number of papers in your stack

    2

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The authors propose a new method for calibrating multi-rater annotation for optic disc and cup segmentation. Unlike previous studies, the proposed approach leverages recurrent attention network and self calibration to simultaneously calibrate segmentation and learn the consensus between the annotations from multi-raters. The design was guided by half-quadratic optimization. They provided extensive experimental validations that outperformed state of the art.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This work is well-motivated and presents a novel approach for an important topic. The multi-rater consensus problem is ubiquitous in clinically relevant algorithm development and the proposed solution has the potential to be applied to other imaging modalities, like digital pathology.
    • A novel and interesting application of half-quadratic optimization in multi-rater calibration for optic disc and cup segmentation. The proof and proposed recurrent design seem pretty solid.
    • Solid experimental validation with two public datasets.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • A concern is how applicable/realistic it is to apply such a model in clinical settings: how time-consuming and how much computational power is needed? Can the hyperparameters like the number of recurrence steps for self-calibration fixed during inference?
    • Grammar issues throughout the manuscript. It is highly recommended that the authors proofread and modify.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Considering the complexity of the model design and missing detail for hyperparameters, it is not easy to reproduce this work and thus the authors are highly encouraged to release code, model architecture and hyperparameters for good reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. The three contributions listed in the introduction sections are all about the methodology. I would suggest the authors to reorganize this part and add experimental validation as one of the contributions.
    2. What are the number of recurrent steps used for each fused label sets in Table 2 and 3? Is it needed to tune this hyperparameter for each fused label set?
    3. For future work, it can be valuable to extend this self-calibration approach to the multi-rater challenge in other imaging modalities, such as MRI and histopathology images.
    4. Minor: 4a. Formatting: Space is missing between the text and the square parenthesis for citation 4b. Grammar: On Page 2 “… which aware the uncertainty .. “ –> “… which is aware of the uncertainty …” 4c. Page 2 “potential correct label” –> “potentially …”
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The multi-rater consensus challenge is ubiquitous in any clinically relevant algorithm development, the method is novel and interesting and experiments are convincing, despite minor concerns about complexity and practicality of the proposed method.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    I kept my rating after reading author feedback and other reviews. The topic is of great interest to the community in my opinion and the proposed method is novel, solid and interesting. Thank the authors for addressing my concerns about complexity and practicality of the proposed algorithm. The concern left is grammar issues throughout the manuscript. The authors agreed to revise the writing, which I think is necessary before it is published.



Review #5

  • Please describe the contribution of the paper

    The work proposes a novel method to learn a segmentation from multi-rater annotations and show the usefulness of the method on the segmentation task of optic disc and optic cup in fundus images. The segmentation output is produced through an iterative optimization of multi-rater expertness estimation and unified segmentation based on expertness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea of using self-fusion label and SSIM loss for the combined segmentation is new.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The general idea of using a combination of meta segmentation loss and annotator specific loss is also suggested in [1].

    [1] Liao Z, Hu S, Xie Y, Xia Y. Modeling Human Preference and Stochastic Error for Medical Image Segmentation with Multiple Annotators. arXiv preprint arXiv:2111.13410. 2021 Nov 26.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors stated that they will provide the relevant codes.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. There are multiple language errors in the work. Please consult with an expert. Few examples:
      • “Comparison with SOTA” instead of “Compare with SOTA”
      • Table 2: “Calibrated” instead of “Calibrate”, “Not Calibrated” instead of “No Calibrate”
    2. It is stated that the calibrated segmentation and multi-rater expertness converges to an optimal solution. However, since the update (equation 2) is a form of alternating minimization algorithm, looks like it might also end up in a local minimum and not necessarily a global minimum. It will converge to a global minimum if W and V are convex measures, but it is not clear if this is the case.

    3. Evaluation metrics: it is not clear why “self fusion” is used as an evaluation metric when it is part of the method itself. Also, the given evaluation metric of “Diag” is actually another method for combining multi-rater annotations and should be compared to as well. On the other hand, the common simple multi-rater evaluation metrics “Random” and “Average” are missing.

    4. It is stated that “self-calibrated segmentation consistently achieves superior performance on various multi-rater ground-truths”. In most cases this statement holds, but not always.

    5. This sentence is not clear as there is no reporting of AUC: “The performance improvement is especially prominent for OC segmentation where the inter-observer variability is more significant, with an increase of 1.83% and 1.72% AUC over current best method on REFUGE-MV and RIGA-MV, respectively.”
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea of using self-fusion labels and SSIM loss for the combined segmentation is interesting. However, the paper has multiple grammar errors and is difficult to follow. There are incorrect statements and the algorithm evaluation has several issues (see details in the comments).

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The idea of the paper of a joint optimization of multi-rater expertness estimation and unified segmentation based on expertness is not new, but the specific implementation is novel and interesting. Looks like this particular method favors a consensus between few annotators instead of a mean result.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a learning method for using multiple-rater’s annotation for image segmentation. The reviewers acknowledged strength in novelty. However, as one reviewer pointed out, the paper was poorly written and hard to follow, which I agree after reading the manuscript. For example, proposition 1 serves as a foundation for the proposed method. However, it is confusing and misleading. The paper needs to be revised to improve its readability before it can be published. Since the reviews are highly diverse, I would like to give the authors an opportunity to clarify any misunderstandings in the reviews.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    13




Author Feedback

We sincerely thank the reviewers for their constructive feedback on our manuscript. We are happy to learn that all reviewers appreciate our novelty. The primary concern of the article is about the writing. We would like to apologize that some descriptions are relatively convoluted and we will try our best to improve the readability in the revision. As suggested, the statement of Proposition 1 will be rewritten. The mathematical formula will be simplified and the main conclusion will be stressed. The revision will be carefully reviewed by an expert to ensure it reaches the publish standard. Besides the writing, several other concerns raised by the reviewers are addressed below.

R5 comments that literature [1] also used a combination of meta segmentation loss and annotator specific loss. We find it is an excellent related work. We will discuss it in the revision and follow up in the future. However, it is also worth noting the motivation and the implementation of it are actually very different with us, e.g., we uniquely solve the problem in a recurrent manner. Thus, we think they are two individual works with no overlap on the novelty. In addition, several concerns are raised about the evaluation metrics. 1) why “self-fusion” is used as an evaluation metric? ‘self-fusion’ is used here to evaluate the consistency between DivM and ConM but not the overall performance of the model. As we mentioned in the article, “the results of ConM are gradually improved on self-fusion labels, indicating the increasing consistency of DivM and ConM.” We are sorry for the confusion we made. We will present it in a clearer way in the revision. 2) “Diag” is actually another method for combining multi-rater annotations and should be compared to as well. Different from other multi-label fusion methods, generating “Diag” requires the diagnosis labels to be provided. As it requires extra knowledge, the comparison will not be faired. However, ‘Diag’ is meaningful as an evaluation metric. A better performance on ‘Diag’ means the segmentation maps can be more helpful for the diagnosis. This kind of segmentation maps could be of great use in the clinical practice. 3) the common simple multi-rater evaluation metrics “Random” and “Average” are missing. We did not report the evaluation on ‘Random’ or ‘Average’ since we find after the model performance achieves a certain level, the two metrics will in favor of ambiguous results, which do not definitely mean they are more helpful. Instead of ‘average’, we used majority vote (MV) for the evaluation, which can also demonstrate the model performance under uniform expertness assumption. As the reviewer pointed out, some statements are also erroneous or not rigorous enough in this version, for example, ‘1.83% and 1.72% AUC’ actually means “1.83% and 1.72% DICE”. The statements will be carefully checked in the revision.

The main concern of R3 is about the complexity and practicality of the proposed method. 1) how time-consuming and how much computational power is needed? Experiments show our 36M parameters model (four recurrences) achieves 8.65 fps on NIVIDIA Tesla P40 GPU with the input image size 3256256, which is often applicable in the real-time segmentation scenario. Our source code will be released with paper. The users can then test the algorithm on their own hardware conveniently. 2) Can the hyperparameters like the number of recurrence steps for self-calibration fixed during inference? Yes. The recurrence steps can be arbitrarily set by the users in the inference. It is a trade-off between speed and model performance. In our experiments on two datasets, four times of recurrence would be the most efficient. R3 also puts forward some valuable suggestions for the writing of the paper. We will modify as suggested in the revision.

[1] Liao Z, Hu S, Xie Y, Xia Y. Modeling Human Preference and Stochastic Error for Medical Image Segmentation with Multiple Annotators. arXiv preprint arXiv:2111.13410. 2021 Nov 26




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The reviewers revised their reviews after reading each others comments and the authors rebuttal to more favorable ratings. Overall, novelty is a key strength of the paper. Please take the reviewers comments into account and improve the clarity and readability of the final version.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    15



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper touches on an important problem that is interestingly tackled. The rebuttal addresses well the concerns of the reviewers proving its value in being included in MICCAI proceedings this year. It is however crucial that the authors revise the paper for clarity and check for grammar mistakes

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10



back to top