Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Xian Lin, Zengqiang Yan, Xianbo Deng, Chuansheng Zheng, Li Yu

Abstract

Transformers have been extensively studied in medical image segmentation to build pairwise long-range dependence. Yet, relatively limited well-annotated medical image data makes transformers struggle to extract diverse global features, resulting in attention collapse where attention maps become similar or even identical. Comparatively, convolutional neural networks (CNNs) have better convergence properties on small-scale training data but suffer from limited receptive fields. Existing works are dedicated to exploring the combinations of CNN and transformers while ignoring attention collapse, leaving the potential of transformers under-explored. In this paper, we propose to build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance. Specifically, ConvFormer consists of pooling, CNN-style self-attention (CSA), and convolutional feed-forward network (CFFN) corresponding to tokenization, self-attention, and feed-forward network in vanilla vision transformers. In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction. In this way, CSA takes 2D feature maps as inputs and establishes long-range dependency by constructing self-attention matrices as convolution kernels with adaptive sizes. Following CSA, 2D convolution is utilized for feature refinement through CFFN. Experimental results on multiple datasets demonstrate the effectiveness of ConvFormer working as a plug-and-play module for consistent performance improvement of transformer-based frameworks.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_61

SharedIt: https://rdcu.be/dnwD9

Link to the code repository

https://github.com/xianlin7/ConvFormer

Link to the dataset(s)

https://challenge.isic-archive.com/data/

https://www.creatis.insa-lyon.fr/Challenge/acdc/


Reviews

Review #1

  • Please describe the contribution of the paper

    Authors propose a ConvFormer, a plug & play convolutional-based modules to use in segmentation networks. This module substitutes regular Feed-Forward (FF) layers in ViT-like networks with CFF (convolutional FF) and a Self-Attention (SA) module with CSA (convolutional SA). Authors demonstrate how it solves the problem of attention collapse in low-training data scenarios. The advantage of this module is demonstrated for five transformer-based segmentation architectures and three biomedical datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The effectiveness of the proposedmodule is demonstrated for five different segmentation architectures for three independent datasets.
    2. In addition to performance metrics authors report attention matrices, demonstrating that their method indeed solved the problem of attention collapse
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1.Authors do not provide standard deviations, nor run any statistical testing to assess the significance of the experimental differences. 2.Authors do not assess the utility of their method for larger training datasets (e.g. > 5000 2D images). Since Transformers-like architectures are known to perform worse than Convolutional-based architectures in low-dataset scenarios it is unclear if the reported networks were trained until the actual convergence.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors perform experiments on 2 public and 1 private dataset. Authors do not provide the code to reproduce the experiments, nor trained NN weights.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Questions to authors:

    1. How do you perform train-test splitting for ICH dataset, slice-wise or patient-wise?
    2. Have you considered your modifications for larger training datasets, or the problem of attention collapse only specific to a low-training set scenario?

    Comments:

    1. It would be nice to add hyperlinks for your references by adding hyperref package, \usepackage[colorlinks]{hyperref}
    2. It would be interesting to see an ablation for Pooling, CFFN and CSA (separately) since you only report metrics after all modifications.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Authors target a very specific problem of attention collapse, solve it, and demonstrate that it imrpoves a segmenatation quality on 3 independent datasets.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors developed a CNN style transformer, the modules of which are named convFormer, to address the problem of attention collapse with transformers. In convFormer, the modules of a vision transformer are replaced with convolutional counterparts. Five SOTA transformer or transformer-CNN hybrid networks were trained with convFormer modules along with three methods for addressing attention collapse on three different datasets. The proposed modules improved the segmentation performance of all the five models, the resulting self-attention matrices showed more structure and the ablation experiments showed the importance and scale of global context relevant for the segmentation task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method is elegant and the various components are well motivated. Extensive evaluation with five different models, comparison with three different methods to address attention collapse on three datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The description provided for how convFormer captures the long-range interactions or global context is unclear. One of the main advantages could be reduction in the model size in terms of the number of parameters but that is not provided in the main paper or the supplementary material.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The proposed approach is an elegant approximation of a transformer model with convolutional units. It is well motivated for training on small medical datasets. It would be interesting to compare the performance of convFormer on relatively large training datasets and assess if it still retains the performance advantage compared to transformer architectures (like Swin UNETR).

    The convolution matrix A is generated from Q and K, which themselves are generated through learnable projection matrices in 3x3 neighborhood, by computing the cosine similarity. The cosine similarity is comparable to the dot product in the transformer module. A is then multiplied by M which is a learnable Gaussian distance map. This could be interpreted as a modulation of similarity map and the rationale for this modulation is not provided. Wondering if this is comparable to the masking step in transformer and how it affects the segmentation performance.

    The details of convFormer in Fig 2 doesn’t match the description in 3.2. Towards the end of this section, it is mentioned that A is multiplied with K but according to Fig 2 A is multiplied by V.

    Consider proving some estimate of model size in Table 1 (number of parameters / FLOPS).

    Please specify the dataset used for training the models (visualization of the self attention matrices) in Fig 3 and alpha used. Is alpha optimized per model and dataset? How does it vary for the different models trained on the same dataset?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper was fairly well organized, easy to read and the modules are well motivated. The performance of all five models improved consistently compared to the originally reported performance and against the three other approaches used for addressing attention collapse.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The rebuttal of the authors provides an explanation of why they think the model will work well with large training datasets, which is not convincing. The original opinion still holds.



Review #3

  • Please describe the contribution of the paper

    The authors proposed a plug-and-play module, ConvFormer, to alleviate the problem of attention collapse in conventional attention mechanism used in ViT. The authors conducted extensive experiments on three different datasets using five different SOTA methods, and they showed that their proposed module was able to outperform the conventional attention mechanism.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The module the authors proposed, ConvFormer, is easy to implement and the results presented in the paper seem to be easy to reproduce.

    2. The problem that the authors are trying to solve is interesting and relevant, since attention collapse is really common in attention-based models, especially in medical computer vision where the data is limited.

    3. The experiments are extensive, as the module is being compared on five different state-of-the-art models on three different datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Most of the method section was spent explaining the flaws of ViT-like attention, while clarifying the superiority of the proposed attention module. However, most of the transformer-based methods used in medical imaging (including the competing methods in the experiment section), include several convolutions before the transformer module. Therefore, the other competing methods also consider 2D positional information when performing attention. The fact that the proposed method operates on 2D is only an advantage over ViT rather than other competing transformer based method.

    2. The tokenization in the original ViT and the pooling in the proposed method are not really comparable, since the tokenization is to tell the network where the patch is coming from while the pooling is to encode the image into some more densed representation.

    3. The use of pooling is questionable, where in for example TransUNet, the input to the module is already deep feature maps.

    4. The motivation of using a cosine similarity map and Gaussian learnable distance map in the CSA is not clearly stated in the manuscript.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The proposed method was tested on public datasets and the method is easy to implement. Besides, the code will be provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. In the method section, the authors may discuss other existing Transformer-based methods rather than only ViT. As is mentioned in the weakness, the advantages of the proposed module over ViT might not be advantageous while being compared with other methods.

    2. Reword the description of the pooling in the proposed method is recommended. The current description exaggerate its effectiveness and novelty, while the comparison with the tokenization in ViT might not be reasonable.

    3. It would be better if the motivation of using a cosine similarity map and Gaussian learnable distance map in the CSA can be explained. Besides, if there are more experiments justifying the choice of these two, the argument will be stronger.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the results are promising, there are some issues within the method section. The reason behind their design of the module is not clear. The pooling part does not seem necessary, while the use of both the cosine similarity and the learnable Gaussian map is unjustified. This raises the question, how and why can the proposed method solve the problem of attention collapse.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The idea to solve attention collapse is interesting. The proposed method is also reasonable. The performance is demonstrated to be good on three datasets with several different segmentation architecture. The reported attention matrices are also helpful for us to validate the effectiveness of the proposed method in solving the attention collapse problem. As for weakness, some minor issues should be explained in this paper, please refer to R3. Generally, this is a good and interesting paper.




Author Feedback

We thank the reviewers for the valuable comments and appreciate the recognition of our method being interesting (MR, R3), reasonable (MR, R2), effective (MR, R1, R2, R3), and extensively-evaluated (R2, R3). The main concerns are addressed as follows: [R3: Pooling and tokenization are not comparable] As stated in ViT, tokenization is to make a 2D image suitable for the 1D input of vanilla transformers. In ConvFormer, the pooling module has the same role to adapt inputs for the following self-attention calculation and feature refinement. In CNN-Transformer hybrid frameworks using convolutions before transformers (e.g., TransUnet), no max-pooling is performed in the pooling module just like setting the patch size to 1 in tokenization. In pure transformer-based frameworks, given larger feature maps, tokenization typically would adopt a larger patch size to reduce the input size. In the pooling module, it is realized via multiple max-pooling operations. Following the comment, we reword the descriptions in Sec. 3.1 as: The pooling module is developed to realize the functions of tokenization (i.e., making the input suitable to transformers in the channel dimension and shaping/reducing the input size when needed) without losing details in the grid lines in tokenization. [R3: The motivation for using cosine similarity and Gaussian learnable distance maps in CSA] Attention collapse can be caused by high correlation and duplication among learned token representations and ignored local relationships among tokens [15,20]. Vanilla transformer calculates self-attention values globally and constrains them into positive (0, 1) by dot-product and softmax, making token representations become similar. Convolution escapes from duplicate representations with less perception redundancy and has a free range of kernel values. Thus, we use cosine similarity in CSA to measure the relevance among pixels while producing both positive and negative values in self-attention matrices, which allows for more diversity in token representations. Meanwhile, a Gaussian learnable distance map is introduced to enhance the locality and adaptively adjust receptive fields to reduce redundant dependency especially with distant pixels, as not all pixels need global interactions which is consistent with scalable convolution kernels. [R3: Advantages of ConvFormer to other transformer-based methods in addition to ViT] Actually, in our experiments, various kinds of transformer-based methods were selected for comparison, including ViT-like (i.e., SETR, TransFuse, and FAT-Net), CNN-ViT cascaded (i.e., TransUnet), and window-/patch-based (i.e., Patcher). Based on Fig. 3 and Table 1, plugging ConvFormer into each method can alleviate attention collapse and achieve better performance. Such results validate the effectiveness of ConvFormer against other transformer architectures. [R1&R2: Evaluation on larger training datasets] Attention collapse would happen even given larger training datasets when transformers are deep. As stated in Sec. 2, existing methods (e.g., Re-attention, LayerScale, and Refiner) to address attention collapse were proposed for natural images and validated on large-scale training datasets (e.g., ImageNet). In our experiments, ConvFormer stably outperforms them as stated in Table 1. In addition, as stated in the supplementary Tables 1 and 2, by using ConvFormer, transformer-based methods outperform existing SOTA methods. We believe ConvFormer is promising to work well on larger training datasets, which will be validated in our future work. [R2: How ConvFormer captures long-range interactions] Long-range interactions are captured by the CSA module. Specifically, CSA first computes the relevance (i.e., cosine similarity) of all pixels to each target pixel, enabling the pixel with a global receptive field. Then, a learnable Gaussian distance map is introduced to remove redundant perceptions. In this way, meaningful long-range interactions are built for each pixel.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal about using consine similarity and Gaussian learnable distance map is CSA makes sense to me. The issues about pooling and tokenization are not comparable are not quite well addressed. However, I still agree with R1 and R2, this is an interesting paper targeting at solving important problems (attention collapse). Please do release the code.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    As mentioned in the reviews, the goal of approximating the Transformer’s functionalities using the CNN-based approaches is interesting, and the method sounds. I prefer to accept this paper.

    Minor: Should there be some upsampling procedures that resize the ConvFormer’s output back to the input image size? Please clarify this in the paper if it is accepted.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The issues with the paper are minor and the authors have responded to these comments. The overall recommendation is to accept and the issues from reviewer 3 have been responded to in detail.



back to top