Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Minghui Chen, Meirui Jiang, Qi Dou, Zehua Wang, Xiaoxiao Li

Abstract

Cross-silo federated learning (FL) enables the development of machine learning models on datasets distributed across data centers such as hospitals and clinical research laboratories. However, recent research has found that current FL algorithms face a trade-off between local and global performance when confronted with distribution shifts. Specifically, personalized FL methods have a tendency to overfit to local data, leading to a sharp valley in the local model and inhibiting its ability to generalize to out-of-distribution data. In this paper, we propose a novel federated model soup method (i.e., selective interpolation of model parameters) to optimize the trade-off between local and global performance. Specifically, during the federated training phase, each client maintains its own global model pool by monitoring the performance of the interpolated model between the local and global models. This allows us to alleviate overfitting and seek flat minima, which can significantly improve the model’s generalization performance. We evaluate our method on retinal and pathological image classification tasks, and our proposed method achieves significant improvements for out-of-distribution generalization.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_30

SharedIt: https://rdcu.be/dnwyJ

Link to the code repository

https://github.com/ubc-tea/FedSoup

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    Averaging weights of neural networks is an interesting idea with several empirical benefits. In federated learning, it is typically used by defining the weights of a ‘global’ model as the average of the weight vectors of all ‘local’ models. In this paper, authors additionally use weight averaging to define the local models themselves. This is done by, for each local client, averaging local model weights with global model weights, if the thus averaged model provides better performance on the local validation set than the local model without averaging.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Interesting methodological novelty in extending the notion of model averaging for wide minima in the context of federated learning.

    • Experiments and analyses of results are well done, and nicely presented.

    • Good paper writing and organization.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Description of related work is missing. Due to this, it is unclear what aspect of the proposed method leads to better performance over other methods.

    • Some details of the algorithm are missing / not clearly specified.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors have agreed to make their code publicly available upon acceptance. Publicly available datasets are used for validating the proposed method.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. In algorithm 1, soup is initialized as an empty set, and usual federated learning is carried out until epoch E. Can you please explain: in which situations will the condition in line 7 be satisfied? When soup is empty, line 7 compares the validation accuracy for client l for models with weights theta_l and average(theta_g, theta_l). As theta_l has been updated in line 5 for samples of client l, one would imagine that these weights would give better validation accuracy than the averaged weights with the global model. Indeed, the authors even say that “ simply integrating a global model can damage the model’s personalization”, which means that integrating the global model would decrease validation accuracy of the local client. In practice, how often is the condition in line 7 satisfied? (ratio of number of times it is evaluated and the number of times it is satisfied). Further, as long as soup is an empty set, line 9 (model patching) will have no effect.

    2. Please clarify how exactly line 4 of algorithm 1, ‘Aggregation’ is implemented? Are the weights of all local models averaged in this step?

    3. Most of the arguments made in section 2.2 are taken from [3]. This is not at all clear upon reading the paper. The section is titled “Trade-off analysis” which suggests that the analysis is a contribution of the current paper. Please make it clear that this is not the case, and give clear credit to [3] as due. In particular, the arguments in the paragraph above equation 2 are not very convincing. For instance, even if loss landscapes corresponding to different data distributions have sharp minima, it is not obvious why one of these sharp minima will be in the epsilon-neighborhood of the other. Also, it seems that the proof of theorem 1 in [3] does not rely on such arguments. Please either justify the made arguments or remove them from the paper. Also, please clearly point out which points in section 2.2 are contributions of this paper (e.g. connection to federated learning).

    4. Based on the point above, I would suggest making section 2.2 shorter. The saved space should be used to provide a description of related work, which is completely missing in the current version. In particular, please explain the main ideas of the methods compared with in table 1, and provide connections of those ideas with the idea of this paper. Without this description, it is unclear what one can make of the quantitative results. If possible, use these connections to provide intuitions as to why the proposed method performs better than other methods.

    5. Please improve the clarity of figure 1. What do the dots between the blue circles indicate in the left sub-figure and the right sub-figure? The word ‘patching’ is misspelled in the figure.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The application of wide minima for improving model generalization is novel, as far as I know. Further, the experiments and analyses are well done.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    My main concern in the initial review was the lack of clarity regarding implementation details. Authors have promised to improve this aspect of the paper, but it is hard to judge this improvement without seeing it. Therefore, I retain my initial assessment of the paper.



Review #2

  • Please describe the contribution of the paper

    The paper proposes a federated learning framework based on model soups to mitigate the tradeoff between personalization and generalization for medical image classification. The proposed approach involves maintaining a pool of global models in each client (temporal model selection) and interpolating the weights between the local and global models (model patching).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper proposes a good analysis of the problem, with a motivated example using the sharpness of the loss landscape to explain the mechanism behind it The proposed approach seems to improve over the SoTA while requiring almost no computation overhead.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It is not clear what are the novel contributions of the paper. Proposing the use of model soups has already been proposed in [28] and its application to federated learning is straightforward. Similarly, model patching was proposed in [10] and its application to the local training of the Federated Learning is a direct application. The novelty and impact of the paper seem therefore limited, and need more elaboration. The impact of each of the two components has not been evaluated separately, and the model soup component is probably sufficient to achieve the obtained results. The improvements compared to SoTA, by taking the variance, is negligible and comparable to previous results. Especially, it is not clear whether the proposed approach aims to improve generalization or personalization. Some existing approaches outperform it on one or the other independently. While the proposed approach has limited computation overhead, the memory overhead is similar to SWA, with the need for each local nodes to keep in memory a large set of models for the soup. This overhead is not addressed.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility is correct. Some important parameters are not disclosed like the model soup size (and its impact on performance).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    1- Could you elaborate on the novelty and added value compared to original implementations of model soups and model patching. 2 -Could you elaborate why both are needed and their contribution to the overall results.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The contribution of the paper seems to reuse two existing mechanism without any specific adaptation to the Federated Learning setting. Depending of the objective (generalization or personalization) there are previous techniques that outperform it.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Dear authors, thank you very much for your precisions and explanations on the benefits of your approach compared to straightforward usage of Model soups and Model patching.

    My second concern was about the ablation study between the full Fedsoup, and Fedsoup without model patching or without modified soup. Thank you for the new result and I believe the paper could be accepted if this ablation is provided in the final version.

    My last concern was about the tradeoff analysis of the approach. I disagree with the authors’ answer that “Table 1 shows that our method outperforms other methods in terms of the trade-off of local and global performance”. The study of a pareto evaluation over two objectives (tradeoff) requires rigor and scientific evaluation and not just providing the values of the two objectives like provided in Table 1. The multi-objective literature includes many metrics such as the hyper-volume, the OS metric, the spacing metric and the overall pareto spread metric… More practically, the use cases where one wants to maximize the local performance and the global performance together are not motivated. In the real applications described in the paper, the local performance makes more sense for the clinicians, and in this case FedProx approach seems better on Pathology dataset. In other settings, such as sharing/distributing models, the global performances may matter more. I would like in the accepted paper to see in a few sentences a practical case where both objectives need to be achieved.

    Overall, I am increasing my score to weak accept given the clarifications above are provided in the final version.



Review #3

  • Please describe the contribution of the paper

    This paper studies the trade-off between local and global performance in FL, which is a noteworthy problem in FL. The authors analyze the trade-off from a new perspective of the sharp valley of loss landscape. A temporal model selection method is proposed to select temporal history models for combination, and the combined global model will be adapted to each local client by model patching. Analytic experiments have been conducted to show the performance trade-off at different personalized levels.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The reviewer appreciates the new perspective of analyzing the trade-off between global and local performance in FL. The presentation is good, and the paper is easy to follow. Experimental results are presented clearly. The effect of sharpness measure and personalized level are further analyzed. Additional results on unseen domains are delightful.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The global loss would not smoothly decrease as the global model is updated by model aggregation. Therefore, the example of Fig 1 may not simulate the actual cases in FL. If the FL approach to minima with fluctuations, the temporal model selection would be affected.
    2. Model patching is applied to enhance the personalization of the local model. Why not use other personalization techniques? What are the benefits of model patching compared with the others?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of this paper is relatively high because most implementation details, such as batch size, learning rate, and optimizer parameters, are provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    As mentioned in the weaknesses section, please verify if the fluctuated loss curve affects the model selection. If YES, are there any better solutions? Moreover, more explanations are needed to support the methods adopted as optimal.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of this paper is above the expectation, which analyzes the trade-off between global and local performance by the sharp valley of loss landscape. The paper is well organized, and the experiments are sufficient, including result comparison with SOTA methods and robustness analysis.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    This paper aims to get a trade-off between generalization and personality in FL. The paper’s novelty is acceptable, as explained in the author’s responses. But the authors should further clarify their additional contributions compared with existing model patching and model soups methods.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Summary: The paper introduces an innovative approach of averaging weights in federated learning, not only for the global model but also for the local models themselves. The proposed federated learning framework based on model soups aims to address the tradeoff between personalization and generalization in medical image classification. It involves maintaining a pool of global models in each client and interpolating weights between local and global models through model patching. The study focuses on the tradeoff between local and global performance in federated learning, analyzing it from the perspective of the sharp valley of the loss landscape. The authors propose a temporal model selection method and conduct analytical experiments to demonstrate the performance tradeoff at different personalized levels.

    Strengths:

    • Methodological novelty in extending model averaging for wide minima in federated learning.
    • Well-executed experiments and analysis with clear presentation.
    • Effective paper writing and organization.

    Weaknesses:

    • Lack of description and discussion of related work, making it unclear how the proposed method improves upon existing approaches.
    • Missing or unclear details of the algorithm, which hampers understanding and reproducibility.
    • Limited novelty and impact of the paper, as the concepts of model soups and model patching have been proposed previously and their application to federated learning is straightforward.
    • Insufficient evaluation of the individual components of the approach, and the impact of each component is not separately assessed.
    • Negligible improvements compared to state-of-the-art methods, and it is unclear whether the proposed approach aims to improve generalization or personalization.
    • Memory overhead due to the storage of multiple models in the soup is not addressed or discussed.
    • Concerns about the simulation of fluctuations in the global loss and the potential impact on temporal model selection.
    • Lack of comparison with other personalization techniques and a need to clarify the benefits of model patching compared to alternative methods.




Author Feedback

We appreciate the valuable comments provided by the reviewers. We are thankful that all the reviewers liked our clear writing and R1&R3 recognized the novelty of our approach. We address the concerns raised by the reviewers below.

Novelty [R2]. Our key innovations over Model Soup [1] and model patching [2] are multifold. Firstly, we argue that incorporating the key idea in [1] into FL is NOT straightforward. As we stated in Sec 1, [1] requires training a large number of models from scratch with different hyperparameters, which is impractical in FL. Therefore, we introduce Temporal Model Selection, “which maintains a client specific model soups with temporal history global model … not incurring additional training costs”. To this end, we propose a novel selection indicator based on local validation accuracy (see Sec 2.3). Although motivated by [1], our new design adds significant practical value in FL. Secondly, we argue that it is not a direct application of [2] in FL. We highlight the motivation for proposing [2] in FedSoup is to address the limitations of temporal global model selection (see Sec 2.3). We adapt model patching by interpolating local models and the global model pool to balance local and global performance trade-offs rather than simply averaging on a global model if directly using [2]. Finally, our contribution lies in providing a new perspective on analyzing the trade-off between global and local performance in the FL loss landscape, as acknowledged by R1&R3.

Justification of the Proposed Modules [R1,2,3] We emphasize the modified model soup and model patching are interdependent modules in our proposed FedSoup. On one hand, model patching is a technique built on our modified model soup algorithm offering abundant models to explore flatter minima. Without model soup, we only have a single local and global model for interpolation, leading to unsatisfied results. On the other hand, model soup itself is for generalization purposes. We stated that using our modified model soup only will hurt model personalization (Sec 2.3). To enhance these statements, we offer experiments on a) model patching only and b) modified model soup only. We observe Retina AUC drop 2.1% local and 2.9% global on a) and drop 3.6% local and 0.9% global on b) compared with FedSoup. [R3, other personalization methods] Our method is orthogonal to other local personalization methods and can integrate with many of them for further improvement. We explored combinations with FedProx, enjoying 1.2% local and 0.9% global increases on Retina AUC compared with using FedAVG backbone.

Clarification of Results [R2] R2 may miss the objective of our work: improving the trade-off tween local (personalization) and global (generalization) performance, which may lead to misinterpretation of our results. We are pleased to see the information was delivered well to R1&3. We claim that the improvement achieved by FedSoup is not incremental. Table 1 shows that our method outperforms other methods in terms of the trade-off of local and global performance. We also further conducted a two-sample t-test with 50 runs to compare with the best baseline FedBABU on the Retina dataset, resulting in a significant increase in AUC (p < 1e-5) for both local and global performance. Although our local AUC is slightly lower (-0.31) compared to FedProx on Pathology, our global AUC surpasses it by a significant margin (+4.22).

Additional Details Memory [R2]. The saved models can be dispatched to the CPU during training to reduce GPU memory usage. Model Selection in Soup [R1]. We can adjust the starting stage of model selection and relax the threshold to increase the number of averaged models and avoid empty soup. Writing [R1]: Following your great suggestions, we will shorten Sec 2.2 and leave the space for more implementation details. Loss Fluctuation [R3]. We employed learning rate decay to address the influence of loss fluctuation.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    We are delighted to accept the submitted paper after careful consideration and fruitful discussions among the reviewers. This work showcases a commendable level of rigor, novelty, and significance in its findings, and it contributes substantially to the existing body of knowledge in the field.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposes a novel federated model soup method to optimize the trade-off between local and global performance. The paper’s novelty is acceptable. One major concern is the clinical value. The whole paper focuses on federated learning, which makes it closed to a machine learning work with only testing on medical dataset. More clinical motivation should be provided.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The reviewers agree to accept this paper after rebuttal.



back to top