Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Meng Wang, Lianyu Wang, Xinxing Xu, Ke Zou, Yiming Qian, Rick Siow Mong Goh, Yong Liu, Huazhu Fu

Abstract

Deep learning models have shown promising performance in the field of diabetic retinopathy (DR) staging. However, collaboratively training a DR staging model across multiple institutions remains a challenge due to non-iid data, client reliability, and confidence evaluation of the prediction. To address these issues, we propose a novel federated uncertainty-aware aggregation paradigm (FedUAA), which considers the reliability of each client and produces a confidence estimation for the DR staging. In our FedUAA, an aggregated encoder is shared by all clients for learning a global representation of fundus images, while a novel temperature-warmed uncertainty head (TWEU) is utilized for each client for local personalized staging criteria. Our TWEU employs an evidential deep layer to produce the uncertainty score with the DR staging results for client reliability evaluation. Furthermore, we developed a novel uncertainty-aware weighting module (UAW) to dynamically adjust the weights of model aggregation based on the uncertainty score distribution of each client. In our experiments, we collect five publicly available datasets from different institutions to conduct a dataset for federated DR staging to satisfy the real non-iid condition. The experimental results demonstrate that our FedUAA achieves better DR staging performance with higher reliability compared to other federated learning methods. Our proposed FedUAA paradigm effectively addresses the challenges of collaboratively training DR staging models across multiple institutions, and provides a robust and reliable solution for the deployment of DR diagnosis models in real-world clinical scenarios.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_21

SharedIt: https://rdcu.be/dnwx3

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a federated uncertainty-aware method based on the temperature reliability of each client with is aggregated with a global encoder. They evaluate on 5 public DR staging datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The focus on addressing uncertainty in non-IID FL settings is an important one. The reweighting method is intutitve and simple to implement.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The improvements in performance seem minor. Most of the baseline methods were developed to address heterogeneity in FL rather than dedicated uncertainty techniques. The experiments seem to focus on AUROC rather than uncertainty metrics such as calibration, which is confusing.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The experiments were conducted on public datsets and some details of experimental settings are reported.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    1) I woiuld like to see comparison to related baseline methods such as multi-domain temperature such as https://arxiv.org/abs/2206.02757 2) Why use Youden index over more common metrics such as precision or AUROC? 3) Would have liked to see ablation experiments with temperature 4) Are performance gains statistically significant. Would recommend reporting variance in results 5) How should the uncertainty score be interpreted clinically? In Figure 2, what does a score of 0.18 vs 0.4 mean to a clinician? 6) “SingleSet” is not defined in the paper 7) The fontsize in the legend is to small to read of Fig. 2

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The technical novelty is limited and the performance improvement are somewhat minor over existing methods. More experiments on other tasks against other uncertainty methods would be needed to establish superiority of the proposed method.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a novel uncertainty-aware algorithm for the model aggregation in FL. The experimental results demonstrate the effectiveness of the proposed algorithm.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed method is technically sound.
    2. The writing is relatively clear.
    3. The experimental results are sufficient.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Some very short equations can be integarted into the text, such as Eq. (4) and Eq. (7).
    2. Out of curiosity, what if we use soft uncertainty value to replace the hard indicator function in Eq. (2)?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The major experimental details are given.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please refer to the strength and weakness.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novel method. Good writing and organization. Sufficient experiments.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This manuscript proposes an uncertainty-aware method to solve the server aggregation problem of federated learning. Based on Dirichlet concentration, authors design a temperature-warmed evidence uncertainty head to obtain prediction and its corresponding confidence for each sample. After collecting the confidence scores of all samples of one client, an uncertainty-aware weighting module is used to estimate the overall uncertainty value of this client, which serves as aggregation weight on the server side. The method is evaluated in simulated FL on a multi-centric diabetic retinopathy dataset and compares favorably to several baselines.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1)Clear structure (2)Thorough comparison to other methods (3)Comprehensive ablation studies.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) Lots of formulas but no clear and straightforward explanation, leading to difficulty to follow. (2) The experiments cannot prove whether the proposed modules solve the target problem.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code and dataset setting will be released upon acceptance

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    (1) In Method: Authors claim “the client with larger distributional heterogeneity tends to have larger uncertainty distribution and should be assigned a larger weight for model aggregation to strengthen attention on the client with data heterogeneity.” a) The first half of this description seems incorrect. In FL, heterogeneity is usually used for the overall FL system, rather than one client. In addition, the description “larger uncertainty distribution” is strange. b) If a larger weight is always assigned to an unreliable client (outlier), this client probably leads to the failure of convergence of the overall FL system. (2) In Method: It is uncertain whether Eq. (5) can supervise the network to learn uncertainty of its predictions. Eq. (5) should be further clearly explained. (3) In Method: Why not directly compute the optimal uncertainty score via P and Y? (4) In Method: Authors finally use the optimal uncertainty scores as the aggregation weights, ignoring the impact of sample sizes of clients. (5) In Experiment: Only showing uncertainty scores of two instances in Fig. 2 (a) and (b) cannot prove that the proposed method can evaluate the reliability of the final decision. Moreover, in Fig. 2 (c), all methods suffer from significant performance decreases and the proposed method does not show big performance advantages in contrast to other methods as noise level increases.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Unclear explanation and difficulty to follow

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    (1) All methods seem to be run only once. How did authors compute p-value? (2) Authors have solved my concerns.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes an uncertainty-aware paradigm that considers the reliability of clients during FL aggregation. The reviewers acknowledge that the paper is easy to follow and the proposed method is solid. However, reviewers have some concerns. They find the explanations of the methods a bit confusing and feel there isn’t enough evidence from the results to support the method. They also noted that comparisons with other similar methods are missing. The authors should address these issues and provide better explanations of their experiments.




Author Feedback

1) Comparison to multi-domain temperature (MDT) (R1) Proposed|0.9445|0.9044|0.8379|0.8012|0.8299|0.8636 MDT |0.9326|0.8908|0.7987|0.7919|0.7965|0.8421

2) Statistically significant (R1, R3) R: We calculated average p-value between the proposed method and other comparison baselines, and all average p-values are smaller than 0.05: Methods|FedRep|FedBN|Moon P-Value|0.002|0.0021|0.0062

3) Optimal uncertainty score and loss function (R1, R2, R3) The uncertainty is the distribution for evaluating model’s prediction of P and is not directly calculated from P. Therefore, uncertainty can not be optimized directly based on the distance between P and Y. In this paper, Eq.(5-6) are employed to guide the model optimization based on the belief masses and uncertainty distribution. The Youden index is a metric that high related to AUROC, which considers both sensitivity and specificity of a diagnostic test and provides an optimal threshold that maximizes the trade-off between these two measures [7]. We also give more details in Sec. 2.2 and Supplementary B.

4) Clinical significance of the uncertainty score (R1, R3). R: Uncertainty scores effectively quantify the confidence of predictions. The value of uncertainty could be used to identify the out-of-distribution case in an open clinical environment [1-2]. For example, the prediction with low uncertainty may be deemed reliable, while for those with high uncertainty is recommended to be checked by an expert to prevent misdiagnosis. This underscores the significance of incorporating uncertainty in our research. [1] M. Abdar, et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 2021. [2] X. Ran, et al. Detecting out-of-distribution samples via variational auto-encoder with reliable uncertainty estimation. Neural Networks, 2022.

5) Definition of SingleSet and ablation experiments with temperature (R1): R: SingleSet indicates that the network is trained locally with the client’s data, without global aggregation. In Table 2, BC+EU represents the uncertainty method without temperature, while BC+TWEU denotes an uncertainty method with temperature. The higher performance of TWEU(LUce+LTce) demonstrates the effectiveness of introducing temperature.

6) Uncertainty-aware weighting module (R3). R: In general, the training data is annotated manually, and has high confidence for the model. Consequently, the client exhibiting a lower uncertainty distribution indicates that the model has been effectively trained on its respective local dataset. This suggests that the data distribution is relatively simple and easily learnable, rendering it an ‘easy client.’ Conversely, a larger uncertainty distribution suggests a more complex local dataset that is challenging to learn, making it a ‘hard client’ that requires more attention. Furthermore, we also attempted assigning lower weights to clients with smaller uncertainty score, but observed a lower performance. Furthermore, this study primarily focuses on how to perform model aggregation based on uncertainty distribution between different clients, which is also not in conflict with the standard FedAvg paradigm based on data size.

7) Robustness(R3) a) We calculated the p-values between the proposed method and the suboptimal method with added noise levels of 0.07, 0.08, 0.09 and 0.01. The results show the robustness of the proposed method under the interference of severe noise. P-Values|0.04|0.03|0.01|0.03 b) Furthermore, uncertainty score can be used to filter the samples with low reliability, prompting the need for a doctor’s reconfirmation to avoid potential misdiagnosis issues. And, by excluding the 20% samples of validation set with high uncertainty values, we observed a substantial improvement in the AUC, increasing it from 0.8636 to 0.8856.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    After reviewing the paper and rebuttal, I recommend acceptance for this work. However, if p-values would be reported (as recommended), the authors should clarify how they were computed. The authors should also incorporate reviewers’ constructive suggestions in the final version.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Although the paper was quite borderline, R3 has raised the score from WR to WA post rebuttal. Personally, I found the rebuttal quite interesting and addressed most of the comments of the reviewers. Thus, I would vote for accepting the paper at MICCAI.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper has a mix of recommendations. The only reviewer who updated their rating raised it from weak reject to weak accept. I went through the complaints mentioned by reviewers, and the authors’ response, and it seems to me that most concerns have been correctly addressed. I acknowledge that this is a very borderline submission, but I believe it has enough quality to as to recommend acceptance, as long as authors add the information in the rebuttal letter to the paper, or at least the supplementary material.



back to top