Authors

Zhipeng Deng, Luyang Luo, Hao Chen

Abstract

Federated learning (FL) has been introduced to the healthcare domain as a decentralized learning paradigm that allows multiple parties to train a model collaboratively without privacy leakage. However, most previous studies have assumed that every client holds an identical label set. In reality, medical specialists tend to annotate only diseases within their knowledge domain or interest. This implies that label sets in each client can be different and even disjoint. In this paper, we propose the framework FedLSM to solve the problem Label Set Mismatch. FedLSM adopts different training strategies on data with different uncertainty levels to efficiently utilize unlabeled or partially labeled data as well as class-wise adaptive aggregation in the classification layer to avoid inaccurate aggregation when clients have missing labels. We evaluate FedLSM on two public real-world medical image datasets, including chest x-ray (CXR) diagnosis with 112,120 CXR images and skin lesion diagnosis with 10,015 dermoscopy images, and show that it significantly outperforms other state-of-the-art FL algorithms. Code will be made available upon acceptance.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_12

SharedIt: https://rdcu.be/dnwAJ

Link to the code repository

https://github.com/dzp2095/FedLSM

Link to the dataset(s)

https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community

https://challenge.isic-archive.com/landing/2018/

Reviews

Review #2

Please describe the contribution of the paper

A federated learning framework is proposed to address the problem of label set mismatch. Specifically, the authors proposed to use uncertainty estimation to split the training data and used pseudo labeling and MixUp for different splits at the client side. At the server side, they proposed an adaptive weighted averaging algorithm. The proposed method is evaluated on two large public datasets including chest Xray images and dermoscopy images.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1)The idea is novel in terms of addressing the label mismatch problem. 2)The methodological part, specifically the MixUp of data with high certainty and low certainty is interesting. 3)The experiments comparison and ablation studies are thorough.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1)The paper did not quantify the label mismatch problem. How severe is this problem and to what extent, the proposed method clarify these points. 2)The use of FedAvg with 100% labeled data as an upper bound needs further justification. Why not other methods?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility of the paper looks okay.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

1) Provides some details about the label mismatch problem. For example, how many labels of class a exists in client x, and how many labels of class b in client y, etc. 2) Explain the use of FedAvg with 100% as an uppderbound.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Addressing label mismatch in medical imaging datasets for federated learning is an important challenge to tackle. Though minor problems exist, the methodology and experiments of this paper is in good quality.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

This paper tries to solve the label set mismatch problem in federated learning. The main contributions of the paper are: 1. This paper proposes uncertainty estimation to alleviate problems resulted from wrong pseudo labels; 2. This paper generates mixup samples to utilize filtered low-certain missing label data; 3. This paper introduces a adaptive classification-layer aggregation method for FL average.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) This paper aims to solve a common yet overlooked issue of missing label set mismatch problem. A simple yet efficient method is also proposed.
(2) An efficient class-wise adaptive aggregation in the classification layer is proposed, to avoid inaccurate aggregation when clients have missing labels. (3) Extensive experiments with necessary ablation studies demonstrate the effectiveness of the proposed method.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) Lots of typo. For example, under the Eq.(2), x_i and x^I are both used. And in the subsection {Uncertain Data Enhancing (UDE)}, x^l sometimes belongs to confident data D_k^l, and sometimes indicate uncertain, such expression is not rigorous. (2) The motivation of Adaptive weighted proxy aggregation (AWPA) is not clear. (3) Updating EDD to the server may introduce extra privacy exposure
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors claim the code will be made available upon acceptance.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

(1) More rigorous equations and formulations are needed. (2) Your motivation towards AWAP should be stated (3) Should refine Fig2(a), clarifying how to operate on complete label data and missing label data clearly.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The task itself is interesting and not well explored in literation. The uncertainty estimation module is novel and necessary in semi-supervised learning. However, the paper would benefit from being more rigorous in several equations, and there is currently a lack of sufficient motivation for some of the modules.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
The paper addresses the label set mismatch problem in federated learning and introduces several contributions. Firstly, it proposes uncertainty estimation to handle incorrect pseudo labels. Secondly, it utilizes mixup samples to leverage low-certainty missing label data. Lastly, an adaptive classification-layer aggregation method for federated learning averaging is introduced. The method is evaluated on two large public datasets, including chest X-ray and dermoscopy images.

Strengths:
1. The paper addresses the commonly overlooked issue of label set mismatch problem and proposes a simple yet efficient method to tackle it.
2. It introduces an efficient class-wise adaptive aggregation approach in the classification layer, which prevents inaccurate aggregation when clients have missing labels.
3. The paper provides extensive experiments and includes necessary ablation studies to demonstrate the effectiveness of the proposed method.
Weaknesses:
1. There are numerous typos throughout the paper, including inconsistent usage of notation, which affects the clarity and rigor of the presentation.
2. The motivation behind the Adaptive Weighted Proxy Aggregation (AWPA) method is not clearly explained, leaving readers uncertain about its purpose and significance in the proposed approach.
3. The paper does not adequately address the potential privacy concerns introduced by updating the Estimated Data Distribution (EDD) on the server, which could lead to additional privacy exposure.

Author Feedback

We thank the meta-reviewer and the reviewers for their comments and constructive suggestions on our manuscript. We are committed to revising our manuscript according to the suggestions, which we believe will significantly improve the quality and clarity of our work.

Clarity and Rigor of Presentation [R1] We acknowledge the typos and inconsistent notations highlighted by the reviewer. We will correct these in the revised version of our manuscript.

Motivation behind AWPA [R1] The inspiration for Adaptive Weighted Proxy Aggregation (AWPA) is rooted in the challenges presented by missing labels, as similarly noted in both FedRS [10] and FPSL [2]. Such missing labels can inevitably cause the model to learn inaccurate proxies within the classification layer. Therefore, our AWPA is designed to adjust the weights of different proxies in the classification layer during model aggregation. This adjustment is based on the proxies’ actual contribution, with more weight being assigned to the proxies of a model if its corresponding label (or pseudo label) appears more frequently during training.

Privacy Concerns [R1] We understand and appreciate the reviewer’s concerns regarding potential privacy exposure associated with updating the Estimated Data Distribution (EDD) on the server. We wish to clarify that EDD is purely a statistical estimate and does not contain any specific patient data. Consequently, we believe the risk of privacy leakage due to the sharing of EDD is minimal and it does not violate the data regulations.

Further Details on Label Mismatch Problem [R2] We acknowledge the request for additional details on the label mismatch problem. In Section 3.2 of our paper, we offer an overview of the federated learning (FL) setup used in our experiments. Specifically, for each client, we labeled 3 random classes and left the remaining 11 as unlabeled for task 1. For task 2, we labeled 3 random classes and left the remaining 4 as unlabeled. Moreover, the detailed numbers of labeled and unlabeled data of each class can be found in the supplementary materials.

The label set mismatch problem commonly exists. For example, 14 diseases are labeled in the NIH CXR dataset, meanwhile, over 100 diseases are labeled in the PadChest dataset. Such inconsistencies pose challenges to developing cross-source diagnostic models, especially under the federated learning scenario.

FedAvg with 100% as an upper bound [R2] We want to clarify that our intention was to provide a reference point for comparison, rather than asserting that this is the maximum possible performance achievable. FedAvg with 100% labeled data represents an ideal yet often unattainable scenario in real-world applications, which helps in highlighting the performance of our approach when label set mismatch is present. We appreciate this comment and in light of this feedback, we will revise our use of the term ‘upper bound’ to avoid any potential misunderstanding.

back to top

Scale Federated Learning for Label Set Mismatch in Medical Image Classification