Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jeya Maria Jose Valanarasu, Vishal M Patel

Abstract

UNet and its latest extensions like TransUNet have been the leading medical image segmentation methods in recent years. However, these networks cannot be effectively adopted for rapid image segmentation in point-of-care applications as they are parameter-heavy, computationally complex and slow to use. To this end, we propose UNeXt which is a Convolutional multilayer perceptron (MLP) based network for image segmentation. We design UNeXt in an effective way with an early convolutional stage and a MLP stage in the latent stage. We propose a tokenized MLP block where we efficiently tokenize and project the convolutional features and use MLPs to model the representation. To further boost the performance, we propose shifting the channels of the inputs while feeding in to MLPs so as to focus on learning local dependencies. Using tokenized MLPs in latent space reduces the number of parameters and computational complexity while being able to result in a better representation to help segmentation. The network also consists of skip connections between various levels of encoder and decoder. We test UNeXt on multiple medical image segmentation datasets and show that we reduce the number of parameters by 72x, decrease the computational complexity by 68x, and improve the inference speed by 10x while also obtaining better segmentation performance over the state-of-the-art medical image segmentation architectures. Code is available at https://github.com/jeya-maria-jose/UNeXt-pytorch

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_3

SharedIt: https://rdcu.be/cVRx9

Link to the code repository

https://github.com/jeya-maria-jose/UNeXt-pytorch

Link to the dataset(s)

https://challenge.isic-archive.com/data/

https://www.kaggle.com/datasets/aryashah2k/breast-ultrasound-images-dataset


Reviews

Review #1

  • Please describe the contribution of the paper

    This work proposes a UNet-like architecture that is very efficient in computational complexity (and inference time), without sacrificing performance on segmentation tasks. This is achieved by (1) reducing the number of convolution filters used, (2) presumbably by using a summing long skip connection (encoder to decoder) instead of concatenation, and (3) replacing the convolutions at the lowest two resolutions with depth-wise convolutions where mixing across channels is achieved by MLP. Horizontal and vertial shifting of the feature maps is also proposed in thse depth conv + MLP blocks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • UNext is computationally efficient and performs well for segmentation.
    • Shifting of feature maps appears novel but is poorly explained and, more importantly, is not well motivated and is poorly validated. It remains unclear what is done, why it is done, and whether it is really a good replacement of positional encoding. It may be a regularizer. Or it may not do much at all.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The method is not sufficiently or clearly described. I had to search for UNext code online to understand it.
    • There is no comparison to, or even mention of, other works that reduce the computational complexity of UNet-like models.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    While the code is not included or linked in the paper, it is easy enough to find. Unfortunately, it was necessary to look at the code to understand the paper since he method was insufficiently described in the paper. Still, the code looks like it should allow full reproducibility, which is great.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Clarity

    • Writing and grammar are generally fine. Poor clarity is mainly a result of missing details or descriptions.
    • eq 4: is T actually T_W?
    • identify ‘PE’ as ‘positional embedding’ in table 2 caption
    • “We split the features to h different partitions and shift them by j locations according to the specified axis. This helps us create random windows introducing locality along an axis.”
      • What does this mean?
      • How do you determine j? is j different for each partition?
      • Don’t you lose much of the info when shifting by a big j?
      • [looking at the code, it seems you roll the map instead of shifting it (and presumably zero-filling it then)]
    • “To tokenize, we first use a kernel size of 3 and change the number of channels to E, where E is the embedding dimension (number of tokens) which is a hyperparameter.”
      • What does this mean?
      • What is the full tokenization? is it just a single conv layer?
      • [looking at the code, it seems to be so - this must be clear in the paper]
    • What kind of skip connections do you use? This is not mentioned in the paper; concat requires many more parameters than sum [looks like it’s sum in the code].

    Positional encoding, shifting, and tokenization

    • [25] encodes position info with 3x3 conv from a 4x4 patch by relying on the zero-padding; there is no such padding nor any patches here so why would conv give a positional encoding?
    • How does shifting give a positional encoding?
    • Is shifting of feature maps just having a regularization effect?
    • It’s not clear that shifting helps or how it helps. The performance improvement is marginal so shifting should be tested on more, larger public segmentation tasks. The benefit of regularization also decreases with training set size so a large training set would be useful.
    • The ‘tokenization’ method is never defined (the explanation is unclear and insufficient).
    • While words like ‘transformers’ and ‘tokens’ are in vogue, they seem ill-applied here where there are no image patches extracted; rather, performance improvements here seem to come from decoupling convolution from feature mixing: conv is depth-wise and feature channels are mixed by a fully connected layer.

    Validation of results

    • How does performance vary across repetitions of model training?
    • How was statistical significance evaluated? This is not described.
    • There is no mention of or comparison to other fast segmentation methods: ENet, FSSNet, FastSCNN, Squeeze U-Net, C-Unet, etc, etc.
    • It’s strange that all methods achieve almost the same performance on MoNuSeg and RITE – what about datasets where recent methods impr
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A high performance segmentation method with low computational complexity is a useful contribution; however, this work is not validated against other low computational complexity segmentation methods. Furthermore, many details of the method are not described and the feature map shifting innovation, while novel, is not well motivated or well tested.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Thank you for providing an anonymous version of the code and clarifying some points. The following points could use more work:

    1. Shifted windows in a swin transformer refers to the overlap of each window with multiple windows in the preceeding layer. This allows focusing self-attention within each window (making it tractable) and ultimately sharing this information across the whole image, since the final layer contains a window that accumulates information from all prior windows. In the code provided with this work, there are no windows and entire feature maps are shifted instead. This does not introduce “window locality”. The comparison of shifted MLP to swin transformer, regarding shifted windows and locality, appears to be a false equivalence. The phrase “inducing locality” is imprecise, hand-wavy, and unexplained. The authors acknowledge that their shifting strategy may have a regularization effect but do provide experimental or argumentative support for the claim that shifting in the MLP induces locality. Furthermore, figure 3 remains uninformative and misleading (the feature map is rolled, not shifted with zero-filling) and shifting is still not sufficiently explained in a way that matches the pytorch code.

    2. “It can be seen that UNeXt is still the fastest network giving a competitive performance when compared with above methods.” – It appears that MobileNetv2 is faster (2.57 ms vs 25 ms) while achieving the same or better performance (80.65 F1 vs 79.37 F1).

    3. It is still unclear how the p-value was computed. Did the authors consider the variance across images or across multiple experimental runs? Did they use a t-test? Any correction methods?



Review #2

  • Please describe the contribution of the paper

    The authors propose a method of modifying U-Net with state of the art MLP-Mixer and Token Mixup inspired techniques. The techniques allow the authors method to achieve state of the art results on 4 datasets while using only a small fraction of the parameters and runtime.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The method appears to be novel, pulling in state-of-the-art papers and techniques from ViT and MLP-Mixer which have previously mostly focused on classification techniques.

    2. The motivation is well founded, being able to learn maintain or slightly improve while dramatically reducing the runtime and memory is essential in many applications.

    3. The authors proposed approach achieves slightly improved results over state of the art, but more impressively, it does so with a fraction of the complexity (space and time). Not only are their practical benefits to this, it shows potential theoretical benefits where strong features are able to be learned without needing the massive parameter counts of similarly performing networks.

    4. The experiments and ablations appear to be thorough and show the contribution of each proposed novelty. Further, the authors splitting of the data and inclusion of error bars was very appreciated.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors state that “It is worthy to note that we experimented with MLP- Mixer as encoder and a normal convolutional decoder. The performance was not optimal for segmentation and it was still heavy with around 11 M parameters.” Although this work is not a segmentation based work, there is enough similarity that the authors felt it justified to add an entire paragraph “Difference from MLP-Mixer”. Given that, the authors should include these results in the main table rather than a vague statement that its “performance was not optimal.”

    2. It would be worthwhile to move the two additional dataset experiments from the supplementary to the main paper. The introduction can be cut down some, there is a bit more space spent on motivation than is really necessary.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reviewer agrees with the checklist and is happy with the reproducibility of the paper overall.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Typo at end of section 2. “…across the embedding dimension H… In our experiments, we set H to 768”. H is used for height. Earlier in the section it was stated that “E” is the embedding dimension.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. State of the art performance at a fraction of the cost.

    2. Some novelty, although mostly pulling in state of the art methods from vision classification works (ViT and MLP-Mixup), the authors do use these in new ways and even have some added novelties.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Based on the feedback from other reviewers and comments from the authors, I will lower my rating slightly from strong accept to accept. Some of the motivation for the introduced techniques does seem a little shaky, but I still stand in support of this paper being published. The results speak for themselves and other works across the computer vision community have looked into pixel shuffling techniques. I don’t think the idea that we don’t fully understand why all the pieces of this work perfectly yet is justification to reject the paper. I might be in the minority here, but I vote for acceptance because I genuinely think the community can benefit from seeing this work.



Review #3

  • Please describe the contribution of the paper

    The authors propose to adopt MLP-based network and combine it with popular ConvNet to achieve faster medical image segmentation tasks with less computation burdens on mobile devices.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The efficiency improvement is satisfactory.

    • The idea of combining ConvNet and MLP-based Network is interesting.

    • The application is of broad interests (i.e., point-of-care applications).

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Some similar ideas of adopting MLP-based networks are recently proposed under heated discussion, thus, the unique novelty or technical contribution (made in this work) should be strengthened and emphasised.

    • Comparison with the counterparts from MobilNet family is missing, which is an important milestone established in the domain of edge/mobile computing. Adding a comparison with them, in terms of both efficiency and effectiveness, would be more convincing. (Some available resource and hope it helps: https://coral.ai/models/semantic-segmentation/)

    • Cross-validation (i.e., 3-fold, 5-fold or etc) is suggested. The current experimental setup is specified as “We perform a 80-20 random split thrice across the dataset and report the mean and variance”. However, the variance of the proposed method (within Tbl.1) looks a bit unstable among the others, especially when comparing with the competitive ones. There is no doubt regarding the efforts paid in reducing the computation burden, but it would be very useful and practical to the community if a statistically solid benchmark can be built.

    • Qualitative analysis on more popular medical image segmentation datasets is preferred, such as MoNuSeg. The qualitative analysis in Fig. 5 looks meaningless and relatively pale. Honestly, the claimed superiority of the proposed method cannot be told from a personal perspective, which brings me the idea that whether more popular and more competitive medical image segmentation datasets are suggested.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors promised to release the code and pre-trained models after review process.

    The dataset split (80:20 split thrice) is also necessary for reproducibility, however, N-fold cross validation is still preferred.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Typo in Page 5: “more smoother”

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See main weaknesses above.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Thanks for the authors preparing rebuttal - they did partially address my concerns. I’d like to change it to “weak accept”, please revise and improve your manuscript carefully as discussed and promised.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This work proposes a UNet-like architecture that is very efficient in computational complexity (and inference time), without sacrificing performance on segmentation tasks. The reviewers agree that this work has merit, but they do not support presenting it in MICCAI for two main reasons. First, the clarity of the paper is relatively low, which hinders reproducibility of this work. Second, the validation results are limited. I encourage that the authors address all reviews in their revision.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9




Author Feedback

We sincerely thank reviewers for their valuable feedback. In what follows we provide clarification to points raised by the reviewers.

1) Reproducibility (Meta-Reviewer, R1, R3): Link to anonymous code: https://anonymous.4open.science/r/UNeXt-anony-5477/README.md. To answer R3’s concern, we also release the random seeds we used to ensure reproducibility for dataset split.

2) Clarity (Meta-Reviewer, R1):

i) T in Eq.4 represents T_W. ii) By creating h partitions and shifting them by j locations, we mean we individually consider partitions in a select axis and shift them by certain pixels. For example, while shifting across height, there will be h=W partitions where W represents the width i.e number of columns. So, each column here is shifted by j pixels downwards. Similarly while shifting across width, there will be h=H partitions where H represents height i.e number of rows and each row is shifted rightward. iii) j is fixed as 5 pixels throughout all operations; so we do not loose much information while shifting. iv) For tokenizing, we use a convolution layer similar to [25]. The conv layer changes the number of channels to E=768 and we average across other dimensions resulting in 768 tokens. v) We use addition for skip connections as concatenation adds to more parameters.

3) Validation of Results (Meta-Reviewer, R1, R3):

i) We follow a 3 fold validation with 80:20 random split. To answer R1’s concern, the performance varies across different repetitions because the train:test data splits are different each time. ii) As R3 points out, current variance reported for UNeXt is slightly more than previous SOTA, but please note that our mean performance is still very competitive even if we consider individual runs. We also want to emphasize that our #params, #FLOPs, inference time are way lesser than previous SOTA. iii) The p-value was calculated based on the F1 score of our method while considering null hypothesis as F1 score of other methods. The p-values were always less than 10\^-5 making our observation statistically significant.

4) Validation against other networks (Meta-Reviewer, R1, R3): In the paper, we mainly compared against other medical imaging methods. As per reviewer’s suggestion, we add more comparison with other low complexity methods in computer vision like FastSCNN, ENet and MobileNet. The numbers are in the order: GFLOPs/ Params/Inf. Speed/F1/IoU. The F1 and IoU are for BUSI dataset.

FastSCNN = 2.17/1.14/6.00/70.14 +- 0.64/54.98 +- 1.21 MobileNetv2 = 7.64/6.63/2.57/80.65 +- 0.34/ 68.95 +- 0.46 ENet = 3.83/0.37/26.99/79.85 +- 1.02/67.14 +- 0.85

It can be seen that UNeXt is still the fastest network giving a competitive performance when compared with above methods.

5) Motivation for Shifting (Meta-Reviewer, R1): Please note that shifting operation is mainly introduced to induce locality in the network. As explained in Shifted MLP section (page 4), adding locality helps to extract local features in an otherwise global network [5]. It also does add a regularization effect but the main idea is to induce window-based attention [5] as visualized in Fig 3.

6) Positional encoding (R1): We would like to clarify here that we did not mention that shifting provides positional encoding. The DWConv layer in Shifted MLP block is responsible for encoding the positional information. We have zero padding in DWConv like in [25] to implicitly learn it. Note that in [25], the input to DWConv is always in feature space (Overlap Patch Embedding is a conv layer) similar to ours.

7) MLP-Mixer (R2, R3): Based on R2’s suggestion, we will add its performance in the main table. To emphasize on technical novelty, we request R3 to read “Difference from MLP-Mixer” section in Page 8 where we explain our difference from previous MLP-based networks.

8) Qualitative analysis (R3): We will add qualitative comparison from MoNuSeg in the final version.

9) Other datasets (R2): We will move those results to main paper from supplementary.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I thank the author for their response, which clarifies most issues. I suggest that the authors revise the paper according to the comments. Also clarify that in the newly added results for MobileNetv2, the inference time is 2.57 ms? Compared to 25ms in the proposed method? Seems odd since the GFLOPS of MobileNetv2 and the proposed method are 7.6 and 0.57, respectively.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper received mixed recommendations before and after the rebuttal. However, the rebuttal adequately addresses most concerns, resulting in two final acceptance recommendations. The AC concurs that the quality of results and the algorithmic efficiency of the approach are of interest to the MICCAI community.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposes a fast U-net like network named UNeXt, and demonstrated the efficiency of the method by comparing to other SOTA segmentation algorithms. The authors submitted a strong rebuttal. Two reviewers commented that the rebuttal helped them understand the strength of the paper better. One reviewer who previously recommended weak reject changed rating to weak accept post rebuttal. Considering the importance and usefulness of the proposed approach (making segmentation faster while keep high accuracy), the paper should be acceptable.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



back to top