Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Jun Shi, Hongyu Kan, Shulan Ruan, Ziqi Zhu, Minfan Zhao, Liang Qiao, Zhaohui Wang, Hong An, Xudong Xue

Abstract

Recently, deep learning methods have been widely used for tumor segmentation of multimodal medical images with promising results. However, most existing methods are limited by insufficient representational ability, specific modality number and high computational complexity. In this paper, we propose a hybrid densely connected network for tumor segmentation, named H-DenseFormer, which combines the representational power of the Convolutional Neural Network (CNN) and the Transformer structures. Specifically, H-DenseFormer integrates a Transformer-based Multi-path Parallel Embedding (MPE) module that can take an arbitrary number of modalities as input to extract the fusion features from different modalities. Then, the multimodal fusion features are delivered to different levels of the encoder to enhance multimodal learning representation. Besides, we design a lightweight Densely Connected Transformer (DCT) block to replace the standard Transformer block, thus significantly reducing computational complexity. We conduct extensive experiments on two public multimodal datasets, HECKTOR21 and PI-CAI22. The experimental results show that our proposed method outperforms the existing state-of-the-art methods while having lower computational complexity. The source code is available at https://github.com/shijun18/H-DenseFormer.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_66

SharedIt: https://rdcu.be/dnwEj

Link to the code repository

https://github.com/shijun18/H-DenseFormer

Link to the dataset(s)

PI-CAI22: https://pi-cai.grand-challenge.org/

HECKTOR21:https://www.aicrowd.com/challenges/miccai-2021-hecktor

Reviews

Review #1

Please describe the contribution of the paper

The paper proposed a novel densely connected transformer, which requires fewer parameters compared to conventional transformer architectures. This block is utilized in a multi-path fashion! The proposed method is evaluated on one public 2D and one public 3D dataset. The proposed method is compared with the state of the arts but with a large gap in terms of the number of parameters.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper proposed a novel densely connected transformer, which requires fewer parameters compared to conventional transformer architectures.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The evaluation is performed over architectures with a significantly higher number of parameters (up to 95.76M), while the authors suspected overfitting the proposed method with 4.02M parameters.
2. The experimental settings are not precise (see 8, 9)
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
1. The training and validation loss of the best weight for each method in Table 2 is not reported.
2. The random seed is not reported.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. In section 3.4 is mentioned that the authors suspect over-fitting when the number of parameters is increased from 3.64M to 4.03M. Is that verified that other methods in Table 2 with the number of parameters between 12 – 95 M are not overfitted? It would be more informative if at least another experiment for each dataset with a similar number of parameters could be added to Table 2. For instance, decrease the number of feature maps in one of the available architectures like TransUNet. Please note that the number of GFLOPs would also significantly decrease.
2. It is not clear what w/o MPE in Table 4 indicates! Is the entire MPE block removed, and only the UNet remains? Or instead of multi-path, all modalities are merged in the input?
3. In Tables 2-4, it is not reported whether the results are significantly different!
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method is novel, even though the evaluation section can be more clarified.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

The evaluation is performed over architectures with a significantly higher number of parameters (up to 95.76M), while in section 3.4 is mentioned that the authors suspect over-fitting when the number of parameters is increased from 3.64M to 4.03M. This concern is not addressed in the rebuttal!

Review #2

Please describe the contribution of the paper

Authors propose H-DenseFormer, a new Efficient multimodal tumour segmentation architecture. H-DenseFormer uses a U-shaped network as the backbone, and uses channel-concatenated (2D or 3D) multimodal images as input, finally outputs the segmentation results of each modal. Different from the basic U-shaped network, the authors propose the Transformer-based Multi-path Parallel Embedding (MPE) module that can extract the rich features from images of different modalities. The features of different modalities generated by MPE are then concatenated and further fused with the features of each stage of the encoder in the U-shaped network through layer-by-layer upsampling operations. To reduce the computational complexity of using Transformer in MPE, the authors also use the densely connection strategy. The experimental results on two publicly available multimodal datasets show that the authors’ method can also outperform some existing advanced methods with low computational complexity.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1、The authors design a Transformer module based on dense connection, which can reduce the computational complexity while maintaining the feature extraction ability of the Transformer, taking into account the balance between efficiency and performance. 2、The authors propose the Transformer-based Multi-path Parallel Embedding (MPE) module to extract the semantic features of different modal images, which is complementary to the backbone model based on input fusion. 3、The paper is well written and the presentation is clear. The method is explained in sufficient detail and the evaluation is appropriate. The results compare favorably to a range of previous methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The novelty of this method is questionable because similar network structures, namely Transformer branch combined with U-shaped network, have been used in previous single-modal medical image segmentation tasks (https://ieeexplore.ieee.org/document/9871945). In addition, the Deep Supervision Loss was also used in this work.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The model is described in sufficient detail, and the authors outline their experimental procedure clearly. The authors use two public datasets and provide code repository links.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Although the authors have conducted extensive comparative experiments, there is a lack of introduction to the implementation of the compared methods. I recommend that the authors explain more about the settings of the methods compared. For example, how the author applies the method in single modality (ITUNet, 2022) to the task of multimodal tumour segmentation.

Typo in page 6 “3D U-Net, [7] UNETR, [16]”–> “3D U-Net [7], UNETR [16]”
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although this method is similar to the previous single-modal segmentation work in the overall structure of the network, it achieves advanced performance in multi-modal segmentation tasks and has less computational complexity, which is favourable to the presented method.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

They present a hybrid architecture for tumor segmentation that integrates a Transformer-based Multi-path Parallel Embedding (MPE) module that is able to extract fusion features from different image modalities. They present a Densely Connected Transformer block to replace the standard Transformer block, reducing the computation costs. MPE assigns an independent encoding path to each modality and then merge the semantic features of all paths and feeds them to the encoder of the segmentation network. The results on two public multimodal datasets shown the effectiveness of the model that also has a lower computational complexity w.r.t. the competitor.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well written and easy to follow. All section are well explained. Each module of the network is very well explained and presented, and the importance of each module is demonstrated with ablation studies. The performance are compared with different sota methods and shown to be better.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

What hyperparameters were used to train the state of the art method? Has a search been done for the best hyperparameters?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

They shared the code and also implementation details are listed in the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

It would be appreciable to say whether the comparison methods have been trained and whether hyperparameters have been researched for them.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is very well written and easy to follow. The architecture is well explained and the results are good.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

The paper is well-structured and effectively communicates its findings. While it would be helpful to provide further clarification on the evaluation of other methods, I am confident in my decision to accept the paper.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposed a densely connected transformer for multi-modal medical image segmentation in a multi-path embedding mode. The design of DCT is reasonable and can reduce the computational complexity. With transformer design, the number of parameters will also be largely reduced.

However I have several concerns about this paper. 1. Why not use BRATS dataset which also have several modalities and is very widely used in multi-modal segmentation. 2. In Table 2, why UNETR is even much worse than pure 3D U-Net for HECKTOR21? 3. In Table 2, the 2D H-DenseFormer is just a little bit better than U-Net++, which is now not a SOTA methods. So I doubt whether this paper indeed compare with the SOTA medical image segmentation methods or not. 4. I am not sure why multi-path embedding design is good? If it is indeed better than early fusion in your experimented datasets, then is it a better solution than early fusion in other multi-modal datatsets (for example, BRATS)? 5. How do you obtain the results of the comparison methods in Table 2?

Author Feedback

Meta Reviewer: Q1: About the BRATS dataset. A1: We select datasets from the following considerations. First, they should cover as many modalities as possible, such as CT or PET, not just multi-sequence MR. Second, the datasets should be able to evaluate both 2D and 3D variants of the proposed method. PI-CAI22 is more suitable for 2D approaches and was chosen due to the small number of slices in each sample. Moreover, compared with BRATS, HECKTOR21, and PI-CAI22 are more challenging. There is much remaining room for performance improvement on them. As you suggested, we think it is better to conduct extended experiments on BRATS for further evaluation. Q2: Performance of UNETR and 3D U-Net on HECKTOR21. A2: We run UNETR on the dataset with open-source code and report the average scores over the 5 folds. It shows worse performance than 3D U-Net. First, the size of HECKTOR21 is only half of the dataset used in the original UNETR paper. Moreover, extensive studies (e.g., nn-Unet) have also demonstrated the better effectiveness and generalization capability of 3D U-Net compared with some models designed for specific tasks or scenarios. Q3: Comparison methods on PI-CAI22. A3: Considering accuracy and efficiency metrics, different methods may perform differently on different datasets. U-Net++ was proposed earlier, but it still performs well on our datasets due to its powerful modeling and generalization abilities. PI-CAI22 is a newly released challenge dataset in 2022. We also select some SOTA methods from the leaderboard for comparison, such as ITUNet (no external dataset). Q4: About MPE design. A1: Early fusion entangles low-level multimodal features in feature extraction and suppresses the feature representation of different modalities. In contrast, MPE decouples the feature representations of different modalities and delays the fusion of multimodal features. We think such decoupling can help to extract higher-quality multimodal features, as demonstrated by many similar studies, such as ref [5,10,15]. Experimental results on our test sets show the effectiveness of the MPE design. In future work, we will evaluate the proposed method on additional datasets (e.g., BRATS) as you suggest.

Reviewer 1: Q1: About additional experiments. R1: Thanks for your comments; we will follow your suggestion to conduct additional experiments to make our paper more informative. Q2: About ablation study on MPE. R2: Sorry for the confusion; the w/o MPE in Table 4 indicates that the MPE keeps only one path, and its input is the multimodal image concatenated in the channel dimension. We will describe it better in the revision. Q3: About the significance test. R3: We will conduct significance test experiments and report the results in the revision.

Reviewer 2: Q1: About novelty. A1: We have also noticed ITUNet. Our approach differs significantly from it. First, our structure is designed for multimodal images. Second, in addition to segmentation performance, we also focus on improving computational efficiency; for example, we develop the DCT to reduce the computational complexity significantly. Finally, the deep-supervision mechanism has become a standard loss calculation module in segmentation models, but its effectiveness mainly depends on the specific loss function design.

Meta Reviewer, Reviewer 2 and 3: Q: About setting of comparison methods. A: First, for methods in the single modality (e.g., ITUNet), we concatenate multimodal images in the channel dimension as network input in the data processing or augmentation stage. Second, for fairness, all comparison methods are trained from scratch with open-source codes, following their original configurations. Our proposed methods use empirical hyperparameter settings, as described in Section 3.2. Furthermore, neither the proposed nor the comparison methods use a particular hyperparameter search. We think if hyperparameter search is used, all methods may gain different degrees of performance improvement.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal has addressed part of my concerns. The concerns, like BRATS dataset and performance of UNETR and 3D U-Net and HECKTOR21 are not well addressed. Also, a very important concern about # of parameters and overfitting from R1 is still there after rebuttal which turns the R1 to negative rating. Generally, the paper still has several important issues not addressed which prevents us to really understanding the method and experiments, thus, I suggest to reject it.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This paper presents a new design of hybrid densely connected network for efficient multimodal tumor segmentation. Overall the paper is well presented and the proposed framework provides a lightweight and novel solution with promising performance. The rebuttal has provided some analysis regarding the novelty, and the clarifications on the experimental settings. This paper can be further improved by including more analysis regarding the computational cost of the proposed modules.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Overall，the paper has weakness in the presentation，the experimental design and parameters, but with some merit. This makes it a borderline paper.

Weakness: From presentation, the authors shall compare with ITUNet and highlight the difference from this method. The parameters shall be carefully tuned, for example, following previous works. The evaluation shall be done including most commonly used dataset including BRATS. The authors actually use the challenging ones (which is OK) but shall not simply ignore BRATS.

Strength: the MPE shall be a new design that might be interesting to the society. Considering the both strength and weakness of the paper as well as its overall ranking, I am more toward accept this paper.

back to top

H-DenseFormer: An Efficient Hybrid Densely Connected Transformer for Multimodal Tumor Segmentation