Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Ailiang Lin, Jiayu Xu, Jinxing Li, Guangming Lu

Abstract

Over the past few years, convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant architectures in medical image segmentation. Although CNNs can efficiently capture local representations, they experience difficulty establishing long-distance dependencies. Comparably, ViTs achieve impressive success owing to their powerful global contexts modeling capabilities, but they may not generalize well on insufficient datasets due to the lack of inductive biases inherent to CNNs. To inherit the merits of these two different design paradigms while avoiding their respective limitations, we propose a concurrent structure termed ConTrans, which can couple detailed localization information with global contexts to the maximum extent. ConTrans consists of two parallel encoders, i.e., a Swin Transformer encoder and a CNN encoder. Specifically, the CNN encoder is progressively stacked by the novel Depthwise Attention Block (DAB), with the aim to provide the precise local features we need. Furthermore, a well-designed Spatial-Reduction-Cross-Attention (SRCA) module is embedded in the decoder to form a comprehensive fusion of these two distinct feature representations and eliminate the semantic divergence between them. This allows to obtain accurate semantic information and ensure the up-sampling features with semantic consistency in a hierarchical manner. Extensive experiments across four typical tasks show that ConTrans significantly outperforms state-of-the-art methods on ten famous benchmarks.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_29

SharedIt: https://rdcu.be/cVRyH

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces the Swin Transformer to medical image segmentation and develops a hybrid network leveraging CNN’s local information extraction ability and Transformer’s long-range dependencies. This model outperforms both CNN and Transformer’s previous SOTA methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper is well-written, understandable, and obtains impressive results.
    2. The paper is insightful. It analyzes the inherent shortcomings of CNN and Transformer and comes up with a well-designed hybrid network to inherit their merits.
    3. DAB (Depthwise Attention Block) looks promising. This could not only be applied to medical image segmentation but also other CV-related tasks.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors mention the “inductive biases” of CNNs several times. However, they did not elaborate on what these biases are. According to Section 2.2, “they may not perform well on medical datasets due to the lack of inductive biases inherent to CNNs.” and then propose “Depthwise Attention Block (DAB)”. This DAB module provides depthwise convolution, channel attention, and spatial attention, which are not provided by vanilla-CNNs. The DAB is an enhancement of CNNs. It may be necessary to revise the expression here to show clearly what vanilla-CNNs bring to Transformer and what DAB brings to CNNs.
    2. The evaluation metrics are insufficient to validate the performance. Specifically, the authors should include experiments of the critical metrics, i.e., E-measure, Fbw, and S-measure. References to these metrics can be found in the following papers:

    [1] Liu et al., Visual Saliency Transformer. ICCV 2021 (E-measure) [2] Wei et al., F3Net: Fusion, Feedback and Focus for Salient Object Detection. (E-measure) [3] Su et al., Selectivity or Invariance Boundary-aware Salient Object Detection, ICCV 2019 (Fbw) [4] Zhao et al., Pyramid Feature Attention Network for Saliency Detection, CVPR 2019 (Fbw) [5] Fu et al., Deepside A General Deep Framework for Salient Object Detection, Neurocomputing 2019 (E-measure, Fbw, S-measure)

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is possible to reproduce the paper. The paper presents a straightforward and well-explained model.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Fig.1’s notions and texts are too small to be read on print paper. Please consider resizing them.
    2. There is an extra “Q” in the first “LN” of the SRCA module in Fig.1. Probably a miss type. The authors may remove it.
    3. To improve Table.1, it would be better to include the categories in which the Methods fall. By adding a column to show whether the methods are Transformer-based or CNN-based.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Refer to weakness and strength section.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This manuscript proposes a hybrid architecture termed ConTrans for medical segmentation. It not only exploits CNN’s capacity in capturing detailed local information, but also leverages Transformer to build long-range dependencies and global contexts. Extensive experiments on four typical medical segmentation tasks across ten public datasets demonstrate its effectiveness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    ++ This method achieves promising results on three common medical segmentation tasks, including polyp segmentation, skin lesion segmentation, pneumonia lesion segmentation and cell segmentation). ++ This paper has good clarification and organization.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    – The authors claim that “The proposed Depthwise Attention Block (DAB)”. However, this module is commonly-used in the computer vision, which could not become the one of three major contributions of this manuscript. – Thought promising results have been achieved, the whole comparison maybe not fair level. The authors better provide the efficiency comparison in the experimental section, for example, the inference speed and model parameters. – The ablation study is insufficient. How about the performance of basis Swin Transformer backbone? Furthermore, I am interesting whether such improvement comes from the Swin Transformer.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    I have checked the ‘Reproducibility Response’ from authors. And I also think it is not difficult to reproduce this model.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please see section 4 and section 5.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, this manuscript have some merits and ideas. However, the introduced attention-based module is not novel enough, which could not become the major contribution. I could not recommend the acceptance of this manuscript at such top-ranking performance.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a model that combines features from Swin Transformer and a well-designed CNN to perform medical image segmentation. A feature integration module named spatial-reduction-cross-attention (SRCA) module are also proposed for effectively fusing the two-style features. Extensive experiments are performed on various datasets to show the effectiveness of the proposed methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The authors explore and exploit the most advanced progress of vision transformer (Swin Transformer, Cross-attention) in medical segmentation task;

    • Evaluation on multiple datasets and achieve state-of-the-art performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The proposed DAB module is basically the same with CBAM ([28] as cited in the paper]), with the only change of depth-wise conv; however, the reason why depth-wise conv would reduce the local redundancy compared to conv is not justified;
    • While the proposed method achieves SOTA on various dataset in terms of segmentation metrics, the model size and FLOPs are not shown. It is hard to know whether the improvement is due to larger model (as two parallel encoders are used here) or the proposed design;
    • The evaluation datasets are mostly 2D segmentation, it would be more interesting to know the performance on, e.g. 3D dataset such as MRI and CT;
    • As Swin Transformer has already introduced locality, the motivation of using two parallel encoders, one as CNN and another as Swin Transformer, is not well-justified;
    • The design motivation of SRCA module is not justified, e.g. why the transformer feature is used as query and CNN feature is used as key and value?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Sufficient implementation details are provided in the manuscript.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Sec 1. P1: “To alleviate such issues, considerable efforts are devoted to enlarging the receptive fields by introducing effective sampling strategies [7,14], spatial pyramid enhancement [31] or various attention mechanism [28]”, how do spatial pyramid and attention mechanism in [28] try to enlarge receptive fields? also, it would be better to cite works in the medical domain if possible;
    • In the SRCA module, transformer features are only used as “query”, that is to say, the transformer features and CNN features are not fused, the SRCA module essentially filter the CNN feature and is equivalent to a spatial attention module of the CNN features; however, the paper claim that SRCA module fuse the two-style features, which is not precise; please consider rephrase the presentation;
    • Ablation study, how do you ablate DAB and SRCA, in other words, what is the baseline version of your model? More detailed analysis of the ablation study is needed;
    • As mentioned in the weakness part, I suggest include the model size and FLOPs as comparison between other methods, as using two parallel encoders may significantly increase the model parameters; also, the design motivation of parallel encoder, DAB module and the SRCA module should be enhanced;
    • Evaluate the methods on 3D medical dataset and compared with, e.g. nnUNet3D.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the proposed model achieve SOTA on various dataset, whether the improvement is due to proposed design or simply using a larger encoder is unclear; on the other hand, most of the contribution is derivative and based on prior work with few modifications and weak motivation.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper introduces the Swin Transformer to medical image segmentation and develops a hybrid network leveraging CNN’s local information extraction ability and Transformer’s long-range dependencies. This model outperforms both CNN and Transformer’s previous SOTA methods.

    This paper is well-written, understandable, and obtains impressive results.

    The proposed DAB module is basically the same with CBAM ([28] as cited in the paper]), with the only change of depth-wise conv; however, the reason why depth-wise conv would reduce the local redundancy compared to conv is not justified;

    While the proposed method achieves SOTA on various dataset in terms of segmentation metrics, the model size and FLOPs are not shown.

    As Swin Transformer has already introduced locality, the motivation of using two parallel encoders, one as CNN and another as Swin Transformer, could be better justified

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6




Author Feedback

1. They did not elaborate on what these biases are. (R1): ViT authors argued that “Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.”.

  1. The DAB is an enhancement of CNNs. (Meta-Reviewer, R1, R2, R3): The proposed DAB is a lightweight module including depthwise convolution, channel attention, and spatial attention. It aims to provide the precise local features that the Transformer branch is missing and need, thereby reducing the local redundancy in the CNN branch.
  2. The evaluation metrics are insufficient to validate the performance. (R1): In fact, ConTrans outperforms state-of-the-art methods in terms of DSC, mean IoU, Recall, and Precision. Due to the limited space, we follow MCTrans[16] and only use DSC as the evaluation metric.
  3. The authors better provide the efficiency comparison in the experimental section. (Meta-Reviewer, R2, R3): Lightweight design and comparison of the inference speed will be included in our future work.
  4. How about the performance of basis Swin Transformer backbone. (R2): Please refer to the 1-th row in table 3.
  5. Evaluate the methods on 3D medical dataset. (R3): Due to limited space, this work focuses on 2D medical segmentation. Note that we evaluate ConTrans on 10 different 2D datasets and obtain significant performance improvements.
  6. The design motivation of SRCA module is not justified. (Meta-Reviewer, R3): CNN and Transformer are generally considered as two distinct techniques for representation learning. To inherit the merits of these two different design paradigms while avoiding their respective limitations, we develop the novel SRCA module, which aims to eliminate the semantic divergence between local features and global representations through the cross-attention mechanism.
  7. Why the transformer feature is used as query and CNN feature is used as key and value. (R3): We follow the typical design of the Transformer decoder, and the TCA module in MCTrans.



back to top