Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Junde Wu, Huihui Fang, Dalu Yang, Zhaowei Wang, Wenshuo Zhou, Fangxin Shang, Yehui Yang, Yanwu Xu

Abstract

With the advancement of deep learning techniques, an increasing number of methods have been proposed for optic disc and cup (OD/OC) segmentation from the fundus images. Clinically, OD/OC segmentation is often annotated by multiple clinical experts to mitigate the personal bias. However, it is hard to train the automated deep learning models on multiple labels. A common practice to tackle the issue is majority vote, e.g., taking the average of multiple labels. However such a strategy ignores the different expertness of medical experts. Motivated by the observation that OD/OC segmentation is often used for the glaucoma diagnosis clinically, in this paper, we propose a novel strategy to fuse the multi-rater OD/OC segmentation labels via the glaucoma diagnosis performance. Specifically, we assess the expertness of each rater through an attentive glaucoma diagnosis network. For each rater, its contribution for the diagnosis will be reflected as an expertness map. To ensure the expertness maps are general for different glaucoma diagnosis models, we further propose an Expertness Generator (ExpG) to eliminate the high-frequency components in the optimization process.Based on the obtained expertness maps, the multi-rater labels can be fused as a single ground-truth which we dubbed as Diagnosis First Ground-truth (DiagFirstGT). Experimental results show that by using DiagFirstGT as ground-truth, OD/OC segmentation networks will predict the masks with superior glaucoma diagnosis performance.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_58

SharedIt: https://rdcu.be/cVRsq

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes an idea for improvement of Optic and Cup Discs (OD-CD) segmentation in eye fundus images in a multilabel by various experts scenario. The idea is based on rating expertness of the different experts by analyzing contribution to the diagnosis using a diagnosis network trained for glaucoma diagnosis.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea of somewhat value the expertness level of each expert by using a diagnosis network is quite original and, and at the same time, sound.

    • Proposal is well explained and elaborated

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Results show that the proposal is somewhat a bit convoluted for the improvement that, in some cases, for instance DICE in Cup Disc Segmentation is not present in a consistent way. These differences, some times in favour of previous SOTA, are not properly addressed and diminish the real contribution value of the proposal as is
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducibility is quite well covered in terms of method explanation, data and implementation details.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Authors propose an interesting idea, well explained and that could have many uses not only in the particular application field that they have chosen. Therefore it has some merit in it.

    Experimentation also shows some good results but fails to allow to definitely conclude that the way to measure and consider multilabel by means of the expertness analysis in diagnosis network. Some results show a decrease in performance that is not correctly discussed.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The experimentation and results are somewhat lacking and need more work in order to propel this idea into a valid proposal.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #4

  • Please describe the contribution of the paper

    The paper proposed a method to segment the cup/disc of fundus images and also use these masks as the auxiliary input to increase the glaucoma detection accuracy. For this reason, a multi-rater fusion and expertness map are also proposed along with a smoothing method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • High frequency filtering technique
    • learning multi-rater expertness maps
    • comprehensive results
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • in general the network architectures never discussed.
    • Authors need to provide a pseudo-code for training and testing phases and explain the details of training steps
    • In equation 2 and 4, the size of ExpG and m are not consistent. (how n is missed)
    • why only dice score is considered? we need more metrics, including false rates
    • I can not find/understand the details of ExpG
    • As far as I know, the masks in refuge-2 was based on 7 raters, but I can not find the 7 masks. it is also not mentioned in the cited reference. Authors may had access to the data in other way. please make sure the claim is correct.
    • what was the architecture of attentive diagnostic network in the figure 1.
    • sharing the codes would be helpful
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • developed codes are not shared.
    • data is publicly available
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    see section 5

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    clarity of the paper, only one metric

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    5

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper
    1. The authors propose a novel OD/OC expertness map generation method called DiagFirstGT. 2.The authors developed the ExpG is developed to improve the performance of DiagFirstGT
    2. The experimental results show that the method has achieved comparable results with the state-of-the-art methods.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors displayed the relationship between the glaucoma diagnosis and the OD/OC segmentation labels. It’s a good view to improve the performance of the glaucoma diagnosis by fusing the multi-rater OD/OC segmentation labels.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The experiments are not sufficient. (1) In section 2.3, the statistical results of the various high-frequency elimination methods should be supplemented. (2) Except for MV, other multi-rater fusion methods in Table (a) should be compared in Table 1 (b).
    2. The related works should introduce in the introduction.
    3. Some writing details of the paper are confusing and need to improve (1) The symbol ~ represents equal in section 2.1, while it means a range in formula (4). (2) h and w denote the size in formula (2), while H and W are used in formula (4). (3) The DiagFirstGT is defined as the result of formula (1) in the second paragraph of section 2.2, while it is redefined as the optimal expertness map in the third paragraph of section 2.2
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code may reproducible based on the paper descriptions.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. In section 2.1, a private dataset is used to show the diagnosis performance of segmentation masks with different qualities. The details of this private dataset should be given. Why not use the public dataset REFUGE-2 as section 3?
    2. Which diagnosis network is used to show the performance of segmentation masks with different qualities in Figure 1?
    3. The result of reference [26] based on ExpG in Table 1 is different to Table 2.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The topic of this paper is good. But the writing of the paper is needed to improve and more experiments needed to supplement.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #5

  • Please describe the contribution of the paper

    The authors proposed a novel strategy to fuse the multi-expertness using an attentive glaucoma diagnosis network of OD/OC segmentation labels via the glaucoma diagnosis performance. In addition, the authors proposed a model termed Expertness Generator (ExpG) to create different glaucoma expertness maps to be fused as a single Diagnosis First Ground-truth (DiagFirstGT).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The ablation study performed in the paper
    • The experimental setup and the comparison with state-of-the-art methods
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The novelty and real-world application of the proposed method

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The paper could be difficult to reproduce because the stage of the proposed method requires some additional details.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • The need to test the proposed method with other large glaucoma datasets.
    • Why not compare the DiagFirstGT with the best models from the REFUGE Challenge?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The need to include experts to deeply analyze results. What are the main reasons for the method to classify/segment some images better than others?
    • The need to test the proposed method with other large glaucoma datasets.
    • Why not compare the DiagFirstGT with the best models from the REFUGE Challenge?
  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a method for fundus image segmentation, which fuses multiple raters annotations by considering their respective expertness derived from a diagnosis task. The reviewers give positive feedback on the novelty, however several reviewers also raised concerns. In the rebuttal, please address the concerns on experiments and clarity.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8




Author Feedback

To begin with, we are grateful for the reviewers’ valuable comments. The primary concern of the article is proposed by R1. The concern is that some of the experimental results, like DICE scores are worse than the previous SOTA and have not been correctly explained. First, we would like to apologize for the misunderstanding we made and we will clarify it in the revision. However, we would like to stress the reported DICE is not used to evaluate the model performance. The declined DICE can only demonstrate the proposed segmentation label is ‘harder’ to learn, but does not mean it is ‘worse’ . As we had mentioned in the article, “That is because MV…is easy to learn, while the fusion of DiagFirstGT…is more difficult to learn. ” The metric used here for evaluation is AUC. In Table 1, we can see the proposed method outperforms all the others on AUC, which indicating the effectiveness of the proposed method.

To address the concern of R1, detailed explanations are provided as follows: 1.Why DICE cannot demonstrate the model performance? Different from the common segmentation tasks, the purpose of the study is to find a potential “correct” segmentation label from a group of segmentation labels collected from multiple raters. The final output of the proposed method is a label of segmentation. Thus, no DICE can be calculated to evaluate the label. Instead, we argued that the diagnosis performance should be the gold standard to evaluate the segmentation labels when there are disagreements among multiple raters. Therefore, instead of DICE, we used diagnosis AUC as the metric to evaluate the generated segmentation labels. 2.How DICE calculated in Table 1(b)? Table 1(b) shows the performance of various segmentation models when they trained on proposed DiagFirstGT or Majority Vote (MV). DICE is calculated between their predictions and the used ground-truths (DiagFirstGT or MV). 3.What do these DICE scores demonstrate? The declined DICE reported here can only demonstrate the proposed DiagFirstGT is ‘harder’ to learn compared with the others, but does not mean it is ‘worse’.

R3 suggests adding the dataset and network information for the experiments conducted in section2.1. We would like to apologize about the insufficient of the related information. In fact, we did the experiments on publicly available REFUGE-2 and similar conclusions can be drawn. We can modify the results and provide brief network backbone information in the revision. As the reviewer wisely suggested, the brief introduction of the related works and several writing details will also be supplied or revised in the modified version.

R4 comments that DICE along is not enough to evaluate the model performance. However, we would like to stress the primary metirc here is AUC. The segmentation performance, like DICE, is just used to demonstrate the difference between the proposed method and the others. Detailed explanation can refer to our response of R1. Some other details, like the brief network architectures and training/test process will be supplied in the revised version.

R5 comments that: 1) the proposed method needs to be evaluated on other large glaucoma datasets. We totally agree the suggestion. Actually, we have conducted the experiments on some other public datasets, and the proposed method still showed the superior performance. However, due to the page limitation of MICCAI, these results cannot be included in the article. We will report these results in other forms, like on GitHub page or on Arixiv. 2) the best models in REFUGE2 challenge are not compared in the experiments. Indeed, Table 1 (a) shows that DiagFirstGT has outperformed the previous 1th solution in REFUGE2 challenge (ours 89.6% AUC against previous 88.3% AUC). We did not include the specific challenge models because they often contain some unique tricks. These tricks may cause the comparison to be convoluted and cause the distraction.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal well addressed concerns on experiments and some clarification questions.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposed an interesting approach for fundus image segmentation by fusing multiple raters annotations. The rebuttal addressed the main concerns.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I agree with the reviewers that the paper introduces a novel idea to a common problem of merging multiple manual annotations. In the text some aspects on the evaluation were confusing, but that was clarified in the rebuttal and could be integrated in the final version. I think the paper would be a valuable and original contribution to the conference.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3



back to top