Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Faris Almalik, Mohammad Yaqub, Karthik Nandakumar

Abstract

Vision Transformers (ViT) are competing to replace Convolutional Neural Networks (CNN) for various computer vision tasks in medical imaging such as classification and segmentation. While the vulnerability of CNNs to adversarial attacks is a well-known problem, recent works have shown that ViTs are also susceptible to such attacks and suffer significant performance degradation under attack. The vulnerability of ViTs to carefully engineered adversarial samples raises serious concerns about their safety in clinical settings. In this paper, we propose a novel self-ensembling method to enhance the robustness of ViT in the presence of adversarial attacks. The proposed Self-Ensembling Vision Transformer (SEViT) leverages the fact that feature representations learned by initial blocks of a ViT are relatively unaffected by adversarial perturbations. Learning multiple classifiers based on these intermediate feature representations and combining these predictions with that of the final ViT classifier can provide robustness against adversarial attacks. Measuring the consistency between the various predictions can also help detect adversarial samples. Experiments on two modalities (chest X-ray and fundoscopy) demonstrate the efficacy of SEViT architecture to defend against various adversarial attacks in the gray-box (attacker has full knowledge of the target model, but not the defense mechanism) setting. Code: https://github.com/faresmalik/SEViT

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_36

SharedIt: https://rdcu.be/cVRtm

Link to the code repository

https://github.com/faresmalik/SEViT

Link to the dataset(s)

https://www.kaggle.com/datasets/tawsifurrahman/tuberculosis-tb-chest-xray-dataset

https://www.kaggle.com/c/aptos2019-blindness-detection/data


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper works on improving the robustness of a ViT to adversarial attacks. The authors propose a self-ensembling technique to learn multiple classifiers based on the intermediate feature representations. Experiments are conducted on the Chest X-ray dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strength: 1.This paper is well organized and easy to follow. 2.The idea of using ensembling learning of intermediate features and novel and seems work well. 3.The work allows us to know more about the combination of adversarial learning and transformer

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Weakness: 1.Can the authors specify what is the motivation or application of adversarial learning in medical area? I can not think of an example where the perturbation happens and the adversarial learning would help with. 2.The conducted experiments in Table 1 is unfair. During the training of ViT, different level of adversarial attacks should be added during the training iteration to improve the ViT’s robustness to perturbation, like what is did in the PGD serial paper. 3.Other adversarial learning techniques should be compared, for example the YOPO (https://proceedings.neurips.cc/paper/2019/hash/812b4ba287f5ee0bc9d43bbf5bbe87fb-Abstract.html), FOSC (https://arxiv.org/abs/2112.08304) and so on.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    can be reproduced

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    My main concern is the the comparison between ViT and SEViT is not fair. If the authors can supply a fair comparison, that would be helpful.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    performance is good and motivation is novel

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a self-ensembling transformer for adversarial robust medical image classification. The proposed SEViT is validated on two public datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well-written and easy to understand. Extensive experiment results show the superiority of the proposed model. Different modalities are used for the experiments.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The motivations are not clear. The proposed method seems to lack novelty. Experiment needs further comparisons with state-of-the-arts.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper can be reproduced.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please see the weaknesses. Other detailed comments are as follows:

    1. In Section 1, the authors state [11], [20], [22], etc, that enhances the robustness of ViTs. However, none of these approaches are compared in the experiments.
    2. It seems the idea is from [22]. More explanations about the improvements would be helpful.
    3. The contribution 3 can be removed since it belongs to experimental results.
    4. Many parameters do not be well defined, such as threshold for KL-matrix.
    5. Some visualization results are suggested to be added to support the results.
    6. The reasons for the noticeable performance drop in MLP number = 12 for X-ray in Fig. 3 (a) should be explained.
    7. It is recommended to compare the proposed model with other state-of-the-art methods.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main factors are insufficient experiments and the lack of novelty.

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper targets a good topic on the robustness of transformer-based models. By studying adversarial attacks on the transformer model, the paper evaluates the effect when the perturbation exists in the Transformer model. In general, this paper is well written and has a potential impact on natural vision problems but limited innovation and interest in the medical imaging community.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    original evaluation of the robustness of Transformer models.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The introduction describes the good background of adversarial attacks and the robustness of the transformer, however, the adversarial attack is more introduced in the natural images processing. The motivation for conducting the same analysis in the medical image analysis is not well discussed.
    2. The paper proposes a simple but effective method of handling adversarial attacks. The self-ensemble approach is clean and easy to follow. Is the method adaptive to natural images such as benchmarking on ImageNet? It would be better to discuss this as the proposed method is designed for generic transformer models not for medical image analysis.
    3. According to Table 1, the proposed method is effective when an adversarial attack exists. However, when comes to robustness as the paper claimed, the performance decreases a lot. Will the community accept the conclusion that a large decrease is observed as adversarial attacks are not commonly considered in the deployment?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    As listed reproducibility checklist, the work should be in good reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The paper demonstrates a good aspect of the evaluation of Transformer model. To further improve the impact on the MICCAI community, the authors can discuss the motivation, challenges, and potential application of how the adversarial attacks are influencing medical image analysis.

    Detailed constructive comments are listed along with the above.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Well-written context and new aspect of the evaluation study.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper receives 2 accepts and 1 weak reject. The paper discusses the robustness of Transformer-based models in the medical area, and proposes a self-ensembling Transformer for building robust medical representation models. All reviewers agree that the paper is well-written and the methodology is clear and well-supported by extensive experimental results. Given that many existing works are discussing building high-performance medical models using Vision Transformers, this study demonstrates yet another appealing aspect - the robustness of the Transformer model, which could be potentially impactful in the MICCAI community. Based on the reviews and the paper, the meta-reviewer recommends an acceptance, and hope that the authors can better clarify the motivation and the application of the studying adversarial robustness in the medical area, to help the community better understand the practicality of the method.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

We thank the reviewers (R1, R3 and R4) and the meta reviewer (MR) for the valuable comments and feedback.

MR.3, R1.3.1, R4.3.1: The motivation for studying the adversarial robustness of deep learning-based medical imaging systems has been discussed in detail by Finlayson et al. in [8], who have identified two links in the healthcare economy that are susceptible to adversarial attacks: (i) automated systems deployed by insurance companies to process reimbursement claims – adversarial samples can fool such systems and trigger specific diagnostic codes to obtain higher payouts, and (ii) automated systems deployed by regulators to confirm results of clinical trials – malicious manufacturers can employ adversarial test samples to successfully pass clinical trials and such attacks may go unnoticed even if there is a human-in-the-loop. In addition, the rapid growth of telemedicine (e.g., teleradiology) during the COVID-19 pandemic and the emergence of “as-a-service” business models based on cloud computing (e.g., radiology-as-a-service) have created an environment where medical images will be increasingly processed remotely – often, automated machine learning algorithms will perform the diagnosis, which may be optionally verified by human experts. Such medical imaging scenarios will be highly vulnerable to adversarial attacks. Moreover, the lack of unambiguous ground-truth, high standardization of medical images, and use of commodity deep neural network architectures (e.g., Vision Transformer) are likely to further exacerbate the vulnerability of medical imaging systems to adversarial attacks [8]. Thus, a robust defensive strategy must be devised before automated medical imaging systems can be securely deployed.

R3.6.[1,2]: [11] and [20] proposed defense mechanisms against adversarial patch attacks on ViT, which is different from adversarial perturbation attacks considered in this work. Hence, direct comparison with [11,20] is not possible. [22] introduced the concept of self-ensembling to generate more transferable attacks against ViT. This is achieved by distilling knowledge from intermediate features of ViT blocks using a shared classifier. In contrast, our approach (SEViT) uses the self-ensembling concept to defend ViTs against adversarial attacks. We construct independent classifiers based on intermediate features of ViT blocks and utilize the consistency between the predictions of these classifiers to make the SEViT more robust against adversarial attacks.

R1.3.[2,3]: SEViT is an adversarial defense method that is completely orthogonal to the adversarial training strategy adopted in YOPO and FOSC algorithms. In adversarial training, adversarial samples generated using PGD attacks are included as part of the training set to make the resulting classifier more robust. The limitation is that such adversarial trained classifiers have low generalizability against unseen attack types. On the other hand, the SEViT method is more generalizable because it does not make any assumptions about the attack type and does not involve any adversarial samples during training. In our experiments, both ViT and SEViT are not subjected to adversarial training, and this makes the comparison fair. Comparing SEViT against adversarial training algorithms and applying adversarial training on top of the SEViT defense are possible future extensions.

R3.6.4: Instead of reporting a single threshold, we have presented the ROC curves in Fig 4 and reported the AUC for different attacks. If required, the threshold value can be chosen based on the desired True Positive and False Positive Rates.

R4.3.[2,3]: We have evaluated SEViT for the medical image classification task and the promising results serve as a motivation to perform further experiments on natural images. Currently, we are working on showing the efficacy of SEViT in the context of natural images. We are also working on minimizing the drop in clean accuracy in the absence of adversarial attacks.



back to top