Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Faris Almalik, Naif Alkhunaizi, Ibrahim Almakky, Karthik Nandakumar

Abstract

Data scarcity is a significant obstacle hindering the learning of powerful machine learning models in critical healthcare applications. Data-sharing mechanisms among multiple entities (e.g., hospitals) can accelerate model training and yield more accurate predictions. Recently, approaches such as Federated Learning (FL) and Split Learning (SL) have facilitated collaboration without the need to exchange private data. In this work, we propose a framework for medical imaging classification tasks called Federated Split learning of Vision transformer with Block Sampling (FeSViBS). The FeSViBS framework builds upon the existing federated split vision transformer and introduces a block sampling module, which leverages intermediate features extracted by the Vision Transformer (ViT) at the server. This is achieved by sampling features (patch tokens) from an intermediate transformer block and distilling their information content into a pseudo class token before passing them back to the client. These pseudo class tokens serve as an effective feature augmentation strategy and enhances the generalizability of the learned model. We demonstrate the utility of our proposed method compared to other SL and FL approaches on three publicly available medical imaging datasets: HAM1000, BloodMNIST, and Fed-ISIC2019, under both IID and non-IID settings. Code: https://github.com/faresmalik/FeSViBS

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_33

SharedIt: https://rdcu.be/dnwyM

Link to the code repository

https://github.com/faresmalik/FeSViBS

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The paper introduces a federated split Learning method combined with a novel block sampling module to leverage intermediate features extracted from a Vision Transformer (ViT) running at the server. The proposed framework enhances the generalizability of the learned model by introducing pseudo class tokens as an effective feature augmentation strategy.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The authors demonstrate the utility of their proposed method compared to other FL and split learning approaches on three publicly available medical imaging datasets under both IID and non-IID settings.
2. The experiments and ablation studies, including several baseline comparisons on three datasets (both IID and non-IID), are promising and pretty convincingly show the generalizability of the method.
3. The paper is well-written and organized.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Some implementation details of the baseline, like for FedAvg and local/centralized training, are missing.
2. Performance-wise, the split learning setting shouldn’t be much different from running the same model in a local environment as the same operations are computations are performed in both environments - only intermediate results are being exchanged in the distributed scenario. Hence, what is meant by “local” baseline should be clarified. Is it a different model?
3. The paper explores only label non-IID. Domain shifts when client data stems from different data sources, as often occur in real FL scenarios, have not been explored.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The methods and implementation details are clearly described. Authors promise to make code available later.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The term “cls token” should be explained when first introduced. Baselines should be better described as mentioned above.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, this is an interesting and well-presented work with good potential for real-work distributed learning applications. The method is clearly described and experimental results are convincing – albeit missing some minor details.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

7
[Post rebuttal] Please justify your decision

The authors provided a strong rebuttal that clarified some points, i.e., confirm that the data already includes multi-center domain shifts. If these explanations are integrated in the final version, it will make a strong paper.

Review #2

Please describe the contribution of the paper

The paper approaches the problem of Federated Split Learning by the proposed FeSViBS. The key contribution of the proposed method is the block sampling module. The author claims that this block sampling module in ViT brings two key benefits: 1. It can utilize the information existing in intermediate ViT features, which is ignored in previous works. 2. It is a feature augmentation to enhance the network’s generalizability.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The topic is interesting and clinically significant.
2. The paper is well-organized and easy to follow.
3. The idea of leveraging intermediate features of ViT seems effective and interesting.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The motivation for proposing block sampling module is unclear. It seems like this strategy is related to stochastic depth network or block dropout in Transformer, which is proposed for efficient learning and speeding up network convergence.
2. Updating parameters in the head, tail, and shared projection network after every backward pass may bring significant communication overhead and make the proposed method impractical in real-world deployment.
3. Authors target on data heterogeneity issue in FL but did not conduct experiments against SOTA methods in this direction (e.g., SCAFFOLD and Adaptive Federated Optimization).
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Authors mentioned that the code will be made available upon request. The availability of the pre-trained model is unclear.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. Authors may elaborate more on the motivation of the proposed block sampling strategy and clarify why it can bring better generalizability to the ViT model.
2. Authors may compare the relevant resource use of different methods. Ideally, just performance gain is not enough for real-world distributed ML. To show the practicability of the proposed method in the real-world setting, authors may provide more details about resource usage (communication/computation cost).
3. There is no mention of a validation strategy that would have guided the selection of many hyperparameters. Yet it seems that no proper tuning opportunity has been given to the proposed method. It appears that the selection was purely heuristic.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The rating is based on the good performance of the proposed approach in federated split learning. As stated in previous sections, authors should resolve several concerns in Q6 and Q9.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

The authors’ rebuttal address my questions I would like to increase my rating.

Review #3

Please describe the contribution of the paper

Novelty: Introduces a block sampling module, which leverages intermediate features extracted by the Vision Transformer (ViT) at the server. This acts as a feature augmentation strategy for better generalisation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Introduces a block sampling module, which leverages intermediate features extracted by the Vision Transformer (ViT) at the server. This acts as a feature augmentation strategy for better generalisation.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

They implement FeSViBS with the first 6 ViT blocks. This needs more explanation. Even though a bit of explanation is given with a figure but it is not enough justification and the choice of 6 blocks seems arbitrary.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Not very satisfied
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Authors should provide more explanation to the implementation of FeSViBS. The rationale behind the choice of 6 blocks need to be added.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well-written and the algorithm is explained properly. A proper comparison with other methods was done and properly explained.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
Summary: The paper introduces FeSViBS, a federated split learning method that incorporates a novel block sampling module to leverage intermediate features extracted from a Vision Transformer (ViT) at the server. The proposed framework improves model generalizability through effective feature augmentation using pseudo class tokens.

Strengths:
- Utility and effectiveness demonstrated through experiments and ablation studies on multiple medical imaging datasets, considering both IID and non-IID settings.
- Well-written and organized paper, facilitating comprehension of the proposed approach and results.
- Introduction of the block sampling module, utilizing intermediate features from ViT as an effective feature augmentation strategy for improved generalization.
Weaknesses:
- Lack of implementation details for important baselines (e.g., FedAvg, local/centralized training), hindering the understanding and reproducibility of the comparisons.
- Unclear distinction between split learning and local model execution, causing confusion regarding the “local” baseline and its purpose.
- Failure to explore domain shifts arising from diverse client data sources, a prevalent aspect in real-world federated learning scenarios.
- Unclear motivation for the block sampling module, which resembles existing strategies without a clear rationale for its purpose.
- Potential communication overhead due to frequent parameter updates in multiple network components, raising concerns about practicality in real-world deployments.
- Absence of comparisons with state-of-the-art methods addressing data heterogeneity in federated learning, such as SCAFFOLD and Adaptive Federated Optimization.
- Inadequate explanation and justification for utilizing the first 6 ViT blocks in FeSViBS, leaving the selection of this specific number arbitrary.
Constructive Feedback:
1. Clarify the term “cls token” when it is first introduced to ensure readers have a clear understanding of its meaning and role in the proposed method.
2. Provide better descriptions of the baselines, as mentioned previously, to enhance understanding and facilitate reproducibility.
3. Elaborate on the motivation behind the block sampling strategy and its potential to improve the generalizability of the Vision Transformer (ViT) model.
4. Compare the resource usage (communication/computation cost) of different methods to demonstrate the practicality of the proposed approach in real-world distributed machine learning scenarios.
5. Discuss the validation strategy used for selecting hyperparameters and provide more details on the tuning process for the proposed method, as it currently appears heuristic.
6. Provide additional explanation for the implementation of FeSViBS, particularly regarding the rationale behind choosing 6 blocks from the Vision Transformer (ViT) model.

Author Feedback

We thank reviewers (R1, R2, R3) and meta reviewer (MR) for their valuable comments. Baseline Details[R1,R2,MR]SL and Local Model[R1,MR]All methods in Table 1 use the same hybrid ViT (h-ViT) architecture - ResNet-50 (head), ViT-base (body), and linear classifier (tail). For SViBS and FeSViBS, a shared projection network is added. The difference lies in how the model is trained. Centralized:A global h-ViT model is trained by pooling data from all clients in one location. Local: Each client trains its own h-ViT model using its local data only. Split Learning (SL):In SLViT and SViBS, each client has its own head and tail, but the body resides on the server and is shared. Thus, the head and tail are trained using only local data of that client, but the body is updated by all clients. Federated Learning (FL):Each client updates a global h-ViT model using its local data, and the server aggregates (via FedAvg/FedProx) the client updates. Federated Split Learning:FeSTA and FeSViBS are like SL except that the local heads and tails are aggregated using FedAvg in each unifying round. Domain shifts[R1,MR]The FedISIC2019 dataset has data collected from 6 centers with large differences in population characteristics and acquisition systems. Since we treat each of these centers as individual clients, this dataset already represents real-world domain shifts. Superior results in Table 1 on FedISIC2019 demonstrate the efficacy of FeSViBS in handling domain shifts. Motivation for Block Sampling[R2,MR]The primary motivation for block sampling is to effectively leverage intermediate ViT features that are better at capturing local texture information (but are lost when only the final cls token is used). Stochasticity in the block selection also serves as a feature augmentation strategy, thereby aiding the generalization performance. While this second goal could also be possibly achieved through stochastic depth networks (SDN) and block dropout (BD) in Transformer, SDN/BD cannot ensure retention of informative texture details. Computational & Communication Overhead[R2,MR]Since all methods in Table 1 have the same h-ViT architecture, their computational costs are similar. SViBS and FeSViBS require training of an additional shared projection network, but this is mitigated by the fact that they do not require forward/backward pass through the entire ViT body. Centralized and local training does not incur any communication cost. For other methods, the communication cost per client per collaboration round is as follows: FedAvg/FedProx:~97M parameters (~12M for head,~85M for body,and 7000 for tail) SLViT/SViBS:~197M values (~195M for smashed representations in forward pass and~2M for gradients in backward pass) for HAM10000 dataset FeSTA/FeSViBS:~197M values +~12M parameters per client per unifying round Thus, the proposed method has marginally higher communication overload compared to SL and twice the communication burden as FL. Comparison with methods addressing data heterogeneity[R2,MR]We compared our method against 2 SOTA non-iid FL methods, SCAFFOLD (ICML 2020) and MOON (CVPR 2021). Balanced accuracy on HAM, BloodMNIST, and ISIC2019 are:1) SCAFFOLD:0.29, 0.880, and 0.330.2) MOON:0.570, 0.903, and 0.450. FeSViBS clearly outperforms these methods, and we will include the results in Table 1. Choosing the first 6 blocks[R3,MR]Block sampling aims to leverage intermediate ViT features. Since local texture details are better captured by initial ViT blocks, sampling from only blocks 1-6 is sufficient to get good performance. This also reduces the computational cost and accelerates training. Selection of hyperparameters[R2,MR]The best hyperparameters (learning rates, batch size, unifying rounds, etc.) for all methods were chosen through standard grid search. cls token[R1,MR] A learnable token captures the global representation of the input image, and the final class prediction is obtained by passing the cls output from the last ViT block to the classifier.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

After carefully considering the authors’ response and the opinions of the other reviewers, I am pleased to recommend the acceptance of the manuscript.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The key contribution of this works is including a new module in ViT-based split FL, which performances as augmentation for latent feature and achieves better performance than baselines. The authors addressed the review concerns regarding motivation, baselines, and overhead. All the review rates turned positive after rebuttal. Thus I recommend acceptance.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

All reviewers agree to accept this paper after rebuttal.

back to top

FeSViBS: Federated Split Learning of Vision Transformer with Block Sampling