Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Nabil Ibtehaz, Daisuke Kihara

Abstract

This decade is marked by the introduction of Vision Transformer, a radical paradigm shift in broad computer vision. A similar trend is followed in medical imaging, UNet, one of the most influential architectures, has been redesigned with transformers. Recently, the efficacy of convolutional models in vision is being reinvestigated by seminal works such as ConvNext, which elevates a ResNet to Swin Transformer level. Deriving inspiration from this, we aim to improve a purely convolutional UNet model so that it can be on par with the transformer-based models, e.g, Swin-Unet or UCTransNet. We examined several advantages of the transformer-based UNet models, primarily long-range dependencies and cross-level skip connections. We attempted to emulate them through convolution operations and thus propose, ACC-UNet, a completely convolutional UNet model that brings the best of both worlds, the inherent inductive biases of convnets with the design decisions of transformers. ACC-UNet was evaluated on 5 different medical image segmentation benchmarks and consistently outperformed convnets, transformers, and their hybrids. Notably, ACC-UNet outperforms state-of-the-art models Swin-Unet and UCTransNet by $2.64 \pm 2.54\%$ and $0.45 \pm 1.61\%$ in terms of dice score, respectively, while using a fraction of their parameters ($59.26\%$ and $24.24\%$). Our codes are available at https://github.com/kiharalab/ACC-UNet.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_66

SharedIt: https://rdcu.be/dnwB0

Link to the code repository

https://github.com/kiharalab/ACC-UNet

Link to the dataset(s)

https://challenge.isic-archive.com/data/#2018

https://polyp.grand-challenge.org/CVCClinicDB/

https://medicalsegmentation.com/covid19/

https://scholar.cu.edu.eg/?q=afahmy/pages/dataset

https://warwick.ac.uk/fac/cross_fac/tia/data/glascontest/


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed a convolutional version of long-range dependencies and cross-level skip connections to improve the performance of UNet.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -This paper provides insightful modification on UNet by borrowing the idea of long-range dependencies and cross-level skip connection

    • The paper is validated on five public dataset.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The new design is arbitrary. The number of hierarchical levels and the parameters such as inv_fcr, inv_fct, k, etc, are arbitrarily determined.
    • Data augmentation. The ultrasound dataset, BUSI, should not be rotated or flipping upside-down. In ultrasound image, the signal attenuated from top to bottom. Simply rotating the image is against the principle of image reconstruction.
    • Ablation study. While both modules are proposed, the impact of each module is not individually studied.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The results are based on public dataset. Code is also provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • As discussed in the weakness section, an ablation study can be very helpful to further justify the performance.
    • More sophisticated data augmentation algorithm could help to boost the performance.
    • While the title and structure of the paper follows Ref[16], the contribution is not as significant as what is claimed in Ref [16].
    • The grammar should be thoroughly checked, e.g., “an UNet” should be “a UNet”
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The score driving factors lie in the improvements on convolutional network and large scales of dataset for validation.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors proposed a completely convolutional Net model (ACC-UNet), inspired from the transformers, attempting to create a UNet competitive to transformer based models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors made a thorough review of recent trend in UNet models in medical image segmentation. The method is new and well described. The evaluation was performed on 5 different medical image datasets. A good comparison with SOTA method was presented.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors didn’t compare the training time of different models. The authors didn’t talk about the limitations of the proposed ACC-UNet.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The 5 dataset used in this paper provide a solid basis of the evaluation. But, no access to the dataset will limit the reproducibility of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Could you explain the difference of the definitions of x_2 in equation (2) and equation (6) (given i = 2)?
    2. Please add the comparison of the training time of the models.
    3. Could you provide accessibility to the different datasets?
    4. Please add the discussion of limitation of the proposed method.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well organized, the method is described clearly and easy to follow. The evaluation is solid using 5 different datasets to prove the potential of the proposed ACC-UNet in medical image segmentation.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper
    • This paper studies how to bring the benefits of Transformer-based architectures (e.g. large context information) onto Convolutional models by purely relying on convolutional operations.
    • The proposed architecture named ACC-UNet is intended for semantic segmentation tasks and has been evaluated on different types of medical datasets, including: polyp, skin lesions, breast ultrasounds, cancerous cells, and pneumonia.
    • Experimental results against Convolutional- and Transformed-based encoder-decoder (UNet) architectures showed the superiority of the proposed architecture both quantitatively and qualitatively.
    • The baselines have been also evaluated in terms of their number of parameters, showing that the proposed architecture requires less parameters than recent Transformer-based architectures, while improving their segmentation results.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed architecture based only on convolutional layers represents a new SOTA on different semantic segmentation tasks, when compared to general convolutional- and transformer-based computer vision architectures.
    • By conducting experiments on different type of image datasets, the authors showed that the model is able to generalise well on different domains.
    • Despite the additional layers the number of model’s parameters in the new approach remain similar to traditional encoder-decoder convolutional architectures, while having less parameters than Transformer-based architectures.
    • In general, the architecture represents an effort towards reviving the interests and potentials of convolutional neural networks.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Despite that the proposed approach outperforms related encoder-decoder (U-Net) based architectures, it would have been interesting to see how the architecture performs against more specialised medical image segmentation approaches, e.g. PraNet, PolypPVT, Polyp2Seg, ACSNet to cite some, or even similar approaches which also try to improve CNNs for segmentation tasks over Transformers: SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Given the fact that the authors have anonymously published their source code together with the training/evaluation pipelines, there are no issues related to the paper’s reproducibility. The readers can therefore follow into detail how the architecture works.
    • Additionally, all the details of the architecture, training procedures, experiments have been clearly described in the article.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The study is solid and the experiments have been carefully designed. However, there is still room for improvement:
    • In Table #1, think about additionally reporting the FLOPS for each architecture.
    • For completeness, what’s the training time/inference time of the different approaches?
    • What are the current limitations and potential follow-up ideas?
    • How does the model perform against more specialised encoder-decoder architectures for medical image segmentation e.g. PraNet, PolypPVT, Polyp2Seg. MSNet? The superiority of the approach against U-Net architectures is clear, but what about against more specialised semantic segmentation architectures?
    • To validate the hypotheses of the approach and to get more understanding on the new modules, it would have been interesting to visualise some of the learned filters.
    • The qualitative (segmentation) results are based on 1x single image for each modality, therefore, it would have been nice to add more additional segmentation results, e.g.in the appendix.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The paper proposes a novel architecture that clearly outperforms related baselines on different medical segmentation datasets. The experiments are solid and understanding how to improve existing architecture, such as CNNs, represents an important steps towards more future sophisticated models.
  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposed a ACC-UNet which uses convolution operators to model the long-range dependency and cross-level skip connections. The proposed method can achieve on par (even better) performance compared with ViT (swin) based segmentation methods on several medical image segmentation datasets. The experiments are indeed impressive. However, there are still some issues in this paper. This paper lacks ablation study, otherwise, the design seems arbitrary. Also, we donot know the computation complexity (flops and time) which is usually more important than number of parameters. Some points are not clear, for example, Why SMESwin-Unet performs so poor in your experiments in Table 1? Because of too many parameters?




Author Feedback

We sincerely thank the reviewers for the valuable feedback. Here we have tried to address the comments:

  1. Ablation study (Meta,R1): We actually presented our ablation study in the supplementary text, where we tried to follow the presentation style of the ConvNext paper. As our study is more focused towards our development roadmap, some interesting scenarios for ablation study are absent in the current version. They are: ACC-UNet without MLFC (91.9% dice, 14.M params), without HANC (90.96% dice, 16M params, 25% more filters were added) and with one more MLFC layer (92.4% dice score, 17.3M params). Regarding k and inv_fctr, we considered them as large as possible, while keeping the #params close to UNet. We will move this to the main text.

  2. Computational complexity in terms of FLOPs (Meta,R4): Our model is comparable to conv UNets (FLOPs between UCTransNet and UNet). However, UNets with transformer backbone have smaller FLOPs due to patch embedding at the start, which downsamples the input by 4x4=16, whereas CNNs process the pixels. Although this makes computing self-attention feasible, it sometimes results in pixelated segmentations around the edges. The values of FLOPs are: UNet(37G), MultiResUNet(1.1G), Swin-Unet(6.2G), SMESwin-Unet(6.4G), UCTransNet(38.8G), ACC-UNet(38G) The FLOPs are computed using the fvcore library from facebook research, we have included these in Table 1.

  3. Training and Inference time (Meta, R3, R4): Most of our focus was put on bringing attributes from transformers to CNNs. In achieving this, we relied on straightforward implementation, and so did not optimize our architecture thoroughly. As a result, our model is a bit slow. This slowdown primarily comes from the concat operation, which is extremely slow in the current pytorch implementation. However, we can avoid most of them by using a pre-allocated tensor and populating it with the computed featuremaps. This reduces the training time by 28.43% and makes it comparable to UCTransNet. Further optimization is possible to reduce the computational time, which we will focus on in our immediate future work. We have added this to the discussion.

Model Training sec/step Inference sec/step
UNet 0.46 0.39
MultiresUNet 0.75 0.53
Swin-Unet 0.55 0.39
UCTransNet 1.33 0.7
SMESwin-Unet 2.6 1.89
ACC-UNet 2.11 0.61
ACC-UNet(no concat) 1.51 0.53
  1. Poor performance of SMESwinUNet (Meta): We agree that the probable cause behind the poor performance of SMESwinUNet is too many parameters. Unfortunately, the model being quite new, has not been sufficiently validated on diverse medical imaging datasets. We could not find any other UNet model with a transformer backbone and skip-connection.

  2. Limitation and follow-up ideas (Meta,R3,R4): As mentioned in point 3, one limitation of our model is training time, due to the large amount of gradients being accumulated from the concat operations. Our current objective is to make the model computationally more efficient. Along with avoiding concat, we wish to perform additional optimizations, e.g. using gate signals and performing modulations, keeping tensors smaller, thus lighter for the GPU. In addition, we would like to explore other ideas originating from transformers, e.g., layer norm and gelu function.

  3. Other comments: Linguistic issue (R1): We have thoroughly revised the manuscript for grammar and clarity. Advanced data augmentation (R1): Since, we were more focused on task agnostic, architectural comparison, we adopted only basic augmentation strategies. Clarification regarding equations (R3): We apologize for this confusion. In eqn 6, x2 refers to the featuremap from level 2 of the encoder, whereas in eqn 2 it refers to the aggregated featuremap in HANC. Comparison with a broader set of models (R4): In our follow-up work, we will evaluate ACC-UNet on particular tasks and compare performance with task-specific models. Additional qualitative results and interpretation (R4): We have added more results in appendix.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal has addressed part of my concerns, for example, computational complexity and inference time. However, we still donot understand the reason behind poor performance of SMESwinUNet. Also, for the training time, we need to know how long we need to train the network to make it converged instead of time cost of each step. The reviewers all agree to accept this paper and I also vote “yes”.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The novel architecture presented in this paper, ACC-UNet, is a completely convolutional UNet model that brings the best the CNN and transformer worlds (ie modeling long range cues). Its efficiency is well illustrated on several datasets. Missing computational complexity and ablation study were the most blocking points, as rightfully raised by the reviewers. The authors perfectly addressed both issues, hence I recommend acceptance of this paper.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Three reviewers are positive to accept this paper. And the authors have provided the rebuttal response to provide ablation study, computational complexity, time performance, and so on. I think this work is ready to be accepted.



back to top