Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Saikat Roy, Gregor Koehler, Constantin Ulrich, Michael Baumgartner, Jens Petersen, Fabian Isensee, Paul F. Jäger, Klaus H. Maier-Hein

Abstract

There has been exploding interest in embracing Transformer-based architectures for medical image segmentation. However, the lack of large-scale annotated medical datasets make achieving performances equivalent to those in natural images challenging. Convolutional networks, in contrast, have higher inductive biases and consequently, are easily trainable to high performance. Recently, the ConvNeXt architecture attempted to modernize the standard ConvNet by mirroring Transformer blocks. In this work, we improve upon this to design a modernized and scalable convolutional architecture customized to challenges of data-scarce medical settings. We introduce MedNeXt, a Transformer-inspired large kernel segmentation network which introduces - 1) A fully ConvNeXt 3D Encoder-Decoder Network for medical image segmentation, 2) Residual ConvNeXt up and downsampling blocks to preserve semantic richness across scales, 3) A novel technique to iteratively increase kernel sizes by upsampling small kernel networks, to prevent performance saturation on limited medical data, 4) Compound scaling at multiple levels (depth, width, kernel size) of MedNeXt. This leads to state-of-the-art performance on 4 tasks on CT and MRI modalities and varying dataset sizes, representing a modernized deep architecture for medical image segmentation. Our code is available here: https://github.com/MIC-DKFZ/MedNeXt

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_39

SharedIt: https://rdcu.be/dnwDN

Link to the code repository

https://github.com/MIC-DKFZ/MedNeXt

Link to the dataset(s)

https://www.synapse.org/#!Synapse:syn3193805/wiki/89480

https://amos22.grand-challenge.org/

https://kits19.grand-challenge.org/

https://www.synapse.org/#!Synapse:syn27046444/wiki/616571


Reviews

Review #2

  • Please describe the contribution of the paper

    In this paper, a new network architecture for medical image segmentation is introduced. The design integrates elements from both ConvNeXt and 3D-UX-Net in both the encoder and decoder components. Additionally, a novel kernel called UpKern is proposed to transfer knowledge from small kernel networks to larger kernel networks. The compound scaling strategy is also employed to scale the network depth, channel size, and kernel size simultaneously.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -The paper is well-written and well-organized. -The novelty of the proposed network lies in its adaptation of ideas from computer vision models and their application to medical imaging. Specifically, the model enhances ConvNets by increasing their scalability (i.e., parameter size) and receptive field size. -The proposed network exhibits strong performance, surpassing not only strong baseline methods in cross-validation but also topping the challenge leaderboard. This demonstrates the effectiveness of the method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -The paper contains some vague statements. Firstly, the primary objective of ConvNeXt was not to enable ConvNets to capture long-range spatial correspondence. Rather, the authors of ConvNeXt attributed the success of Transformers to their superior scaling properties, which can be achieved through large-scale data and models. Therefore, their motivation was to augment ConvNets with scaling behavior similar to Transformers using advanced convolution operations and modules. However, in this paper, only the parameter size was increased, not the dataset size. -Additionally, the authors implied that the strong performance of the proposed model was due to its ability to capture long-range spatial representation. However, the effectiveness of long-range spatial representation is still a topic of debate, as discussed in [1, 2]. -Moreover, although the authors increased the kernel size of the convolutional operations, it is unclear whether the effective receptive field size truly increased. The paper could benefit from visualizing effective receptive fields. -The specifics about the pretraining of the network with small kernels were not provided. Did the authors employ a self-supervised pretraining approach? Alternatively, were the small and large kernel networks trained in a progressive manner? -While the proposed method demonstrated promising performance, the comparison presented in Table 2 may not have been entirely fair. It should be noted that SwinUNETR and UNETR each used a unique pipeline, different from that of nnUNet. As a result, retraining them with nnUNet’s pipeline might lead to a somewhat biased comparison. This is supported by the BTCV challenge leaderboard, where both UNETR and SwinUNETR achieved test dice scores of approximately 0.9, while nnUNet and the proposed method scored around 0.88. However, the proposed method exceeded these baseline scores by more than 0.4 when using the authors’ data-split and training pipeline. This raises concerns about the fairness of the comparison.

    1. Li, Jun, et al. “Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives.” Medical image analysis (2023): 102762.
    2. Park, Namuk, and Songkuk Kim. “How Do Vision Transformers Work?.” International Conference on Learning Representations.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the paper is good.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please provide feedbacks about the questions listed above.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, this paper demonstrates good quality. However, there remain a few questions that require further clarification.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents an innovative volumetric segmentation model called MedNeXt, which adopts a ConvNeXt architecture resembling vision transformers (ViTs). Notably, large kernels with depth-wise convolution operations are employed to emulate ViTs. The authors utilize ConvNeXt building blocks for both up and downsampling layers. To enhance the gradient flow during training and preserve contextual richness for dense segmentation, the authors introduce Residual Inverted Bottlenecks. Additionally, to optimize performance on large kernels in MedNeXt, the authors propose UpKern, inspired by Swin Transformer V2, where large kernels are initialized with trained upsampled small kernel networks. The proposed model is evaluated on four volumetric datasets to demonstrate its effectiveness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The research paper is commendably well-written and effectively conveys the concepts in an easily understandable manner. Furthermore, the various architectural component choices are elucidated clearly, aiding comprehension.

    2. The proposed model is thoroughly evaluated on multiple datasets and extensively compared with relevant baselines, demonstrating a comprehensive experimental analysis.

    3. The authors have successfully integrated several techniques, including Residual Inverted Bottlenecks, UpKern, and Compound Scaling, to enhance the segmentation performance on diverse datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. My primary concern is the lack of clarity regarding utilizing Deep Supervision at each decoder layer for all the models presented in Table 2. If Deep Supervision is not applied consistently across all the models, it raises questions about the fairness of the comparison, as it is widely acknowledged that Deep Supervision generally enhances performance.

    2. The statement suggesting that using ConvNeXt blocks in 3D-UX-Net is only partial and limits their potential benefits is not clearly supported and requires further clarification.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducible

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    A more comprehensive comparison between the proposed model and the recent similar model, 3D-UX-Net, would elevate the quality of this paper from good to great. Additionally, the authors should consider updating the manuscript to provide further clarity on the concept of Deep Supervision, as this would shed light on the key architectural factors that contribute to the improved accuracy of the proposed model.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well written, and the work interests the MICCAI community but has a minor weakness.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    The work proposes a novel network architecture termed as MedNeXT which is a transformer inspired large kernel segmentation network. It serves the motivation of having network architectures that are better capable of 3D segmentation tasks for radiology datasets. Technical novelty has tried to address performance improvement particularly for small annotated datasets by upsampling small kernel networks. Testing was done on 4 public datasets of varying size which were acquired in CT and MR, SOTA performance is claimed.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strong validation as testing has been performed on 4 datasets and good performance

    Well described technical contributions

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No statistical significance tests and std deviations are missing, however the mean of 5-fold performance is provided which is a modest indicator of performance

    While this work shows the promise of the proposed approach with individual components such as UpKern, Compound Scaling etc. It does not offer deeper insights such as feature level comparison as to why these work out better for medical image segmentation task. For e.g is it possible to get a more visual understanding of why these components are better. In short the interpretability of the proposed network architecture is limited.

    While compound scaling shows improvements, it also introduces another hyper-parameter for tuning which is additional complexity for a user to deal with

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Fairly reproducible as 5-fold testing has been conducted, however errors/std have not been reported nor has statistical significance been reported.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    is it possible to get a more visual understanding of why the proposed components are better ? How can the proposed model be interpreted in terms of what kinds of features is it learning better as compared to other network architectures?

    The computational cost of this model has not been discussed in terms of relative size with other existing network architectures? For e.g nnUnet and SwinUNETR or other architectures could be used as references in Table 1. in supplementary which should be a part of main paper, since this work is primarily about a network architecture

    While this network architecture has been shown to be useful for segmentation based tasks, there are many other tasks in the medical imaging domain such as object detection, classification, it would be worth to add what is known about the generalizability of this model towards other tasks

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    technical novelty, multiple dataset experiments lead to well established results

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper presents a convolutional network architecture for 3D segmentation based on the findings in ConvNeXt. The ConvNeXt architectures are purely convolutional architectures that resemble the characteristics of Transformers for 2D image classification. In this paper, the authors adopt some findings from ConvNeXt and also introduce some new techniques such as UpKern for 3D segmentation. As pointed out by the reviewers, this paper is well written and the experimental results on four publicly available datasets show the strong performance of the proposed MedNeXt architecture. On the other hand, there are also concerns related to the experimental designs and missing important details.




Author Feedback

We would like to thank the reviewers and meta-reviewer for their positive impressions on our work and address the points raised by them:

1) We apologize for the lack of visual illustrations of the segmentation results as pointed out by Reviewer 1 but we had hoped that the combination of volumetric (dice score) and surface (surface dice) error measures would serve as a viable alternative to highlight performance while allowing us to be economical with the limited space available.

2) Reviewer 2 rightfully points out that there are multiple facets of possible investigations into the effectiveness of these modules. We maintain that follow up work will focus further analysis of the architecture.

3) To answer Reviewer 2, Small kernel ‘pre’-training is done in a supervised fashion with the same training schedule and loss function as the large kernel nets. The only difference is that the large kernel nets are initialized with the small kernel net - hence, we call it pre-training. This is not self-supervised pretraining as in SwinUNETR for example.

4) As pointed out by Reviewer 2, the evaluation against UNETR and SwinUNETR raise additional challenges due to their original training pipelines. However, we felt it best to enforce a uniform framework of training, augmentation and losses on all nets and illustrate how networks perform with all things being equal.

5) As pointed out by Reviewer 3, regarding a more comprehensive comparison against 3D-UXNet, we do not claim that 3D-UX-Net is ineffective in any way because of its partial convnext architecture - rather we mean to say that we believe that convnext blocks should help performance and believe that our use of these blocks uniformly throughout the architecture gives us an advantage over that partial usage by 3D-UX-Net and is supported by our performance.

6) As pointed out by Reviewer 3, disentangling the efficacy of deep supervision on our architecture would have been an ideal ablation. We however could not explore this in the context of this work but will keep the valuable recommendation in mind for further work in this direction.



back to top