Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Ruipeng Zhang, Ziqing Fan, Qinwei Xu, Jiangchao Yao, Ya Zhang, Yanfeng Wang

Abstract

Federated learning has been extensively explored in privacy-preserving medical image analysis. However, the domain shift widely existed in real-world scenarios still greatly limits its practice, which requires to consider both generalization and personalization, namely generalized and personalized federated learning (GPFL). Previous studies almost focus on the partial objective of GPFL: personalized federated learning mainly cares about its local performance, which cannot guarantee a generalized global model for unseen clients; federated domain generalization only considers the out-of-domain performance, ignoring the performance of the training clients. To achieve both objectives effectively, we propose a novel GRAdient CorrEction (GRACE) method. GRACE incorporates a feature alignment regularization under a meta-learning framework on the client side to correct the personalized gradients from overfitting. Simultaneously, GRACE employs a consistency-enhanced re-weighting aggregation to calibrate the uploaded gradients on the server side for better generalization. Extensive experiments on two medical image benchmarks demonstrate the superiority of our method under various GPFL settings. Code available at https://github.com/MediaBrain-SJTU/GPFL-GRACE.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_2

SharedIt: https://rdcu.be/dnwAz

Link to the code repository

https://github.com/MediaBrain-SJTU/GPFL-GRACE

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The main contribution of the paper is a “Gradient Correction” framework called GRACE that attempts to balance the twin objectives of personalization (good performance on local test data) and generalization (good performance on out-of-domain unseen test data) in federated learning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The attempt to simultaneously achieve both personalization and generalization in federated learning (FL) is relatively novel. Earlier approaches were mainly focused on one of the two.

    2) The number of algorithms that have been benchmarked is impressive.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main contribution of the paper is a “Gradient Correction” framework called GRACE that attempts to balance the twin objectives of personalization (good performance on local test data) and generalization (good performance on out-of-domain unseen test data) in federated learning.

    1) The attempt to simultaneously achieve both personalization and generalization in federated learning (FL) is relatively novel. Earlier approaches were mainly focused on one of the two.

    2) The number of algorithms that have been benchmarked is impressive.

    1) The primary weakness of the proposed approach is the heuristic aggregation phase (Section 2.3).

    a) Consistency between gradients cannot be the sole indicator of gradient quality. Suppose that we consider the extreme case where all the gradients are identical. In this case, no learning is likely to happen through aggregation because there is no additional information available.

    b) Furthermore, the theoretical analysis of the reweighting scheme (eq. (4)) does not appear to make any sense. The term consistency degree has not been defined and it is not clear why higher consistency degree will lead to better out-of-domain performance.

    2) It is not clear how the proposed method has been evaluated. There will two models at the end - a local model for each client and a global model. Is the personalization result (Table 1) based on the local models and the generalization result (Table 2) based on the global model?

    3) The tables in the paper are poorly explained. There is no mention about what these numbers represent (what is the exact evaluation metric?). The improvements appear to be marginal and may not be statistically significant in most cases (especially given the high standard deviations reported in the supplementary material).

    4) More ablation studies are required to understand the relative contributions of the client and server side corrections. What happens when no server side correction is employed? Again Table 4 is very to interpret.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Appears to be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please address the weaknesses discussed earlier.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the topic is interesting and the proposed setting is relatively novel, the technical correctness of the proposed approach is not convincing. The empirical results also do not demonstrate any significant improvement.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a new method to tackle both personalized and domain generalization FL problems in medical imaging. Their method consists of client- and server-side elements. On the client, meta learning is conducted to align the features between the local and the global models. On the server, aggregation weights are computed for each client based on cosine similarities of model parameters. In the experiments section, the authors prove their approach on a skin lesion classification and a prostate segmentation task. According to their motivation and problem statement, the method is evaluated regarding its personalization capabilities, where it is on par with SOTA methods, and its generalization capabilities, where SOTA methods are outperformed by the proposed approach.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The authors introduce two - to the best of my knowledge - novel FL method components: one for model aggregation and one for the local learning algorithm, which demonstrate consistent improvements in the experiments.

    • The experiments use two common and public datasets and compare the proposed method to many baselines. Ablation studies provide additional insights on the importance of each algorithm component and a qualitative demonstration of the similarity of feature representations from the global/local model(s).

    • The paper is generally clearly structured and well-written.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Many details on the experiments are not reported: The network architecture is not described, the evaluation metrics are not explicitly given and there is no description of hyperparameters nor model selection strategies. Furthermore, the exact procedures for personalization and test-time adaptation strategy are not included.

    • The method requires computation of O(K**2) - K is the number of federated sites- scalar products between model parameter vectors at each aggregation, which does not scale well to very large federations or models. In currently realistic medical settings with O(10) sites, this may still be feasible, though.

    • The proposed method allows using different choices for the client-global alignment loss, which is in general a nice property. However, the authors do not report how they made this choice in their experiments. From the tables, one can conclude that for one dataset the CORAL method was used and for the prostate task the MMD method. This kind of model selection should be justified for a fair comparison.

    • While the authors provide a high level motivation of the approach, an intuitive explanation of design choices like why meta-learning optimization is necessary compared to a simple regularization term would help to motivate the individual design choices.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The datasets are publicly available

    • The description of hyperparameters and implementation details is meager, so it is important that code to reproduce the experiments will be published.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Abstract:

    • minor language issues: “widely existed”, “almost focus”

    Intro:

    • Why is only IOP-FL mentioned here and not also the other GPFL methods like FedRoD, PerFedAvg and pFedMe?

    Method:

    • Very detailed introduction and problem statement why FedAvg is not enough, which is well written and good for the understanding but not 100% necessary
    • Eq. 2: shouldn’t it be theta’ in the RHS Lm term?
    • An intuitive explanation why the meta-learning step is preferable to simply adding the alignment loss as a regularization term would be helpful
    • Aggregation: Weighting similar gradients higher increases consistency, but reduces the “diversity”. Did you explore this tradeoff further? For example, it might happen that sites that are more “unsual” are disadvantaged during training at the expense of optimizing others.
    • Eq. 4/theoretical analysis: The assumption c1=…cM seems very restrictive and unrealistic. Does the analysis only hold under this assumption or do you mean that equality holds in this special case? In any case, this analysis only proves an intuitive conclusion and hence is not absolutely required imo.

    Experiments:

    • Extensive experiment section but conclusions are very short
    • Which metrics are applied / “what” are the numbers in the tables? It might be standard for the datasets used but it’s still important to have a short description.
    • missing hyperparameter description: which network architecture and other essential hyperparameters used? How were hyperparameters tuned or models selected for the many baselines?
    • How was the personalization done? Just the last local model checkpoint/best local checkpoint according to a local validation set/…? Was this also done for FedAvg?
    • How was test-time adaptation performed? Are the upper methods in table 3 adapted from the FedAvg model? Which dataset (split) was used?
    • Did you also try to use the feature alignment losses as regularizers instead of the meta-optimization framework as an ablation study?

    Figures/tables

    • Fig. 1: in general, it gives a good overview, although it is a bit small, Fig.1(a) could maybe be condensed further since it mostly describes the default FL setup + the OOD testing setup. Fig.1(b) is hard to understand without reading the Methods section.
    • Fig. 2: why are Fig 2 (a) and (b) necessary? Don’t they share the same message?
    • Fig 3 is interesting, but does not provide many new insights, since the ablation study already shows that the aggregation leads to better training results. Maybe something for the appendix.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Details and alignment loss

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Federated learning suffers from domain shift challenges where there are opposing pulls to minimize the shift while simultaneously improving local performance. The authors propose GPFL: “generalized and personalized federated learning” through a novel GRAdient CorrEction (GRACE) method. GRACE incorporates a feature alignment regularization under a meta-learning framework on the client side to correct the personalized gradients from overfitting. Simultaneously, GRACE employs a consistency-enhanced re-weighting aggregation to calibrate the uploaded gradients on the server side for better generalization.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors recognize that in Federated learning - particularly when the models are deployed - there is a conflict between global generalization to minimize domain shifts and need for local performance. Their insight is that local models improve with global models that are generlized to the larger set of federated data, while global models, in turn, improve from feedback from local models. They call these corrections - GRACE for gradient corrections.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The theory is described well for M models but they test with only TWO (2). So, it is unclear how much this theory will generalize. That said their evaluation is fairly exhaustive with a number of SOTA methods.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper should be fairly reproducible since the authors state that they have / will provide code. I am not sure I could find that in the paper. May be I missed it and it should be more clearly pointed out.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The authors point out an important problem in federated learning and propose an interesting solution. However, their solution may not stand the resilience test as more datasets are aded into the mix. Further, the writing / description of the content in the paper could be made much less convoluted and lucid. There are also some grammatical errors scattered making understanding difficult.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Federated learning is an important topic. The authors describe its promise and challenges clearly. They identify key challenges and provide theoretical basis for their approach. However, they have used only two datasets for demonstrating effectiveness. They should have used more to provide strong justification/evidence of success through their method.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper introduces GRACE (Gradient Correction), a framework called GRACE that addresses the challenge of balancing personalization and generalization in federated learning. The authors propose a method for personalized and domain generalization in medical imaging using a combination of client-side and server-side elements. On the client side, meta learning is used to align local and global models, while on the server side, aggregation weights are computed based on cosine similarities of model parameters. The approach is evaluated on skin lesion classification and prostate segmentation tasks, demonstrating competitive personalization capabilities and superior generalization compared to state-of-the-art methods. GRACE incorporates feature alignment regularization and consistency-enhanced re-weighting aggregation to correct personalized gradients and improve generalization in federated learning.

    Strengths:

    • Novel approach in attempting to balance both personalization and generalization in federated learning.
    • Impressive benchmarking effort, evaluating a significant number of algorithms.
    • Introduction of effective components for model aggregation and local learning, demonstrating consistent improvements in experiments.
    • Use of common and publicly available datasets, enabling thorough comparisons with baselines.
    • Clear structure and well-written content, enhancing readability and understanding.
    • Recognition of the conflict between global generalization and local performance in federated learning, addressed through the GRACE framework for gradient corrections.

    Weaknesses:

    • Heuristic aggregation phase based solely on gradient consistency lacks clarity in theoretical analysis and may not capture the full gradient information.
    • Lack of clear explanation and justification for the choice of evaluation metrics, significance of improvements, and experimental details.
    • Insufficient reporting of network architecture, hyperparameters, model selection, and personalization/test-time adaptation procedures.
    • Limited ablation studies hinder the understanding of the individual contributions of client and server-side corrections.
    • Concerns about scalability due to the computational requirements, particularly in larger federations or models.
    • Lack of justification for the choice of client-global alignment loss in the experiments.
    • Need for more intuitive explanations to motivate design choices, such as the necessity of meta-learning optimization.
    • Limited testing with only two models raises uncertainties about the generalizability of the proposed theory.
    • Despite extensive evaluation with state-of-the-art methods, further investigation is needed to ensure broader applicability.

    Constructive feedback:

    • Provide a more comprehensive theoretical analysis of the heuristic aggregation phase, addressing concerns about capturing full gradient information.
    • Clearly define and explain evaluation metrics, significance of improvements, and experimental details to enhance the transparency and reproducibility of the results.
    • Include detailed information about network architecture, hyperparameters, model selection, and personalization/test-time adaptation procedures to ensure robustness and reliability of the approach.
    • Conduct additional ablation studies to better understand the contributions of client and server-side corrections.
    • Explore more scalable approaches to address computational requirements in larger federations or models.
    • Provide a stronger justification for the choice of client-global alignment loss in the experiments.
    • Offer more intuitive explanations for design choices, such as the need for meta-learning optimization, to enhance the understanding and motivation behind the proposed method.
    • Extend the testing to a wider range of models to establish the generalizability of the proposed theory.
    • Consider further investigation and experimentation to ensure the applicability and effectiveness of the approach beyond the evaluation with state-of-the-art methods.




Author Feedback

We sincerely appreciate the meta-reviewer and all reviewers for their constructive suggestions and feedback provided during the review process.

About open source: We’ll release the source code including training on GRACE and SOTA methods, hyperparameters, metrics, and TTDA experiments.

Response for summarized weaknesses and constructive feedback: -On motivation, theoretical analysis, and full gradient information: Gradient consistency as an effective measure to evaluate the non-iid extent has drawn increasing attention in some recent works (e.g. FedRoD, Ditto), but in the form of the variance of local gradients. Different from their consideration for local training, we show that properly utilizing the gradient consistency during aggregation can promote faster and more consistent convergence, and thus benefits the global performance (Fig.3). In Eq(4), we calculate the sum of weighted average consistency of FedAvg and GRACE, indicating that our method will always guarantee higher gradient consistency than FedAvg. The equality conditions show GRACE is the same as FedAvg only when the clients are iid and are better when non-iid. Furthermore, our method employs a soft reweighting strategy for model aggregation, which actually preserves full gradient information from the clients, but the preference of each client differs during the aggregation.

-Metric & Details: Model performance is evaluated from the views of both personalization and generalization. We follow the leave-one-domain-out strategy used in DG and IOP-FL. “0-5” & “A-F” in Table 1-3 represent the results obtained by training with all clients except the left-out one. The generalization metric is the performance of the global model on the left-out domain, which measures the ability to generalize to the unseen domain. The personalization metric is the average performance of local models on local validation sets, which measures the ability to personalize on the seen domains. We keep the same training setting as reported in Flambdy and IOP-FL, and the weight for alignment loss is 0.1 for MMD and 1 for others. We would like to clarify that GRACE achieves SOTA on both global generalizability and local personality, which cannot be balanced by previous methods. More details for experimental setup will be enriched in the final version and fully reflected in our open-source code.

-Ablation & Alignment Loss: Ablation of only server-side or only client-side correction has been shown in Table 4. In terms of alignment loss only, we add the results of 3 different alignments as follows. Considering both performance and efficiency, we prefer CORAL. |Alignment\Dataset| ISIC (P/G) | Prostate (P/G) | |Adv.| 72.43/51.29 | 93.13/86.46 | |CORAL| 75.27/51.73 | 93.04/86.74 | |MMD| 75.05/51.29 | 92.96/86.52 |

-Scalability: Medical scenarios often involve cross-silo FL settings (like in IOP-FL(TMI 2023), CCST(WACV 2023)), where the client number is typically <100, causing seconds-long waiting time for GRACE (quadratic growth with client number). We’ll discuss and explore the ways to enhance efficiency in the revision.

-Explanation for design: Meta-learning optimization provides a natural way to unify the goals of personalization and global generalization into a paradigm of two phases. Meta-train phase can be seen as a rapid personalization process, and we introduce the alignment loss in meta-update phase to maintain global generalization. As for server correction (Sec 2.3), data quality and quantity are equally important in medical scenarios under non-iid. Existing methods relying solely on quantity-based weighting are suboptimal. We consider the quality in aggregation by using gradient consistency for reweighting.

-For more models & applications: Currently, we select representative classification and segmentation tasks and use the benchmark-required models, and we will consider applying GRACE to a wider range of models and tasks to show efficiency and effectiveness in more explorations.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper presents a relatively novel approach to achieving both personalization and generalization in federated learning (FL), which distinguishes it from previous research focused on either aspect. The extensive benchmarking of multiple algorithms adds credibility to the study, while the introduction of two novel FL method components demonstrates consistent improvements in experiments. The use of common datasets, comprehensive comparisons with baselines, and insightful ablation studies provide a strong empirical foundation. The paper’s clear structure, well-written content, and the authors’ understanding of the trade-off between global generalization and local performance further contribute to its strengths. Therefore, I strongly recommend accepting the paper for publication.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper propose a novel solusion and chieved promissing performance. It is technical sound and do introduce some insight. I would suggest accept after taking the suggestions from the reviewers.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This manuscript introduces a gradient correction-based technique for non-iid FL and provides an interesting perspective on the subject. This is an interesting work. However, upon my careful reading and consideration of the rebuttal, I have some concerns that lead me towards not recommending this paper.

    The authors’ rebuttal, though appreciated, does not completely address the pivotal concerns related to the correctness of the constraint and the comprehensive theoretical analysis, previously raised by Reviewer 1 and echoed in the meta-review.

    Reviewer 1 insightfully observed that consistency is not the only determinant of convergence. This is an area where the paper would benefit from more in-depth analysis and clarification, which, unfortunately, was not evident in the rebuttal.

    The authors chose to reference two related works – FedRoD, and Ditto, to support their argument. While these works are indeed relevant to the field, their alignment with the current work’s rationale is questionable, as neither directly address gradient divergence by directly modifying the gradient as proposed here. Instead, they introduce new optimization frameworks or implement constraints as regularization for consensus.

    Beyond the circumstances highlighted by Reviewer 1, there are further concerns from me about the convergence. The gradient similarity in this work is presumably influenced by factors such as the step size and learning rate, aspects which the theoretical analysis and rebuttal don’t sufficiently acknowledge.

    In summary, while the paper offers a noteworthy proposition of altering aggregated gradients using similarity, I hold reservations regarding the method’s validity. Future iterations of this work could be significantly strengthened with more rigorous justification of the aggregation step or by testing different local step sizes and learning rates. This would lend more credibility to the proposed method, making it more compelling.



back to top