Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Yao Zhang, Nanjun He, Jiawei Yang, Yuexiang Li, Dong Wei, Yawen Huang, Yang Zhang, Zhiqiang He, Yefeng Zheng

Abstract

Accurate brain tumor segmentation from Magnetic Resonance Imaging (MRI) requires joint learning of multimodal images. However, in clinical practice, it is not always possible to acquire a complete set of MRIs, and the problem of missing modalities causes severe performance degradation in existing multimodal segmentation methods. In this work, we present the first attempt to exploit the Transformer for multimodal brain tumor segmentation that is robust to any combinatorial subset of available modalities. Concretely, we propose a novel multimodal Medical Transformer (mmFormer) for incomplete multimodal learning with three main components: hybrid modality-specific encoders that bridge a convolutional encoder and an intra-modal Transformer for both local and global context modeling within each modality, an inter-modal Transformer to build and align the long-range correlations across modalities for modality-invariant features with global semantics corresponding to tumor region, and a decoder that performs a progressive up-sampling and fusion with the modality-invariant features to generate robust segmentation. Besides, auxiliary regularizers are introduced in both encoder and decoder to further enhance the model’s robustness to incomplete modalities. We conduct extensive experiments on the public BraTS 2018 dataset for brain tumor segmentation. The results demonstrate that the proposed mmFormer outperforms the state-of-the-art methods for incomplete multimodal brain tumor segmentation on almost all subsets of incomplete modalities, especially by an average 19.07% improvement of Dice on tumor segmentation with only one available modality. The source code will be publicly available after the blind review.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_11

SharedIt: https://rdcu.be/cVRyp

Link to the code repository

https://github.com/YaoZhang93/mmFormer

Link to the dataset(s)

https://www.med.upenn.edu/sbia/brats2018/data.html

Reviews

Review #1

Please describe the contribution of the paper

The paper has presented tranformer network for multimodal medical data like Brain MRI dataset. The presented approach is capable to handle incomplete information in the dataset. The same is validated with the experimental results with BRATS dataset.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is well written and results are properly presented.
2. The manuscript is focused on dealing with incomplete data which is a general issue in most of the medical datasets.
3. The mmFormer architecture is presented well to discuss each individual module in it.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The results are shown on BraTS 2018 dataset, where as BraTS 2021 is also available now. The experiments can also be extended.
2. There should be a subsection to discuss complexity of the proposed mmFormer architecture in reference to other methods.
3. The last line in conclusion section seems to be is contradictory. Need a rewriting. (“Our method gain more improvements when more modalities are missing…”). However, from Table 1, results are not best when there is only one modality is present and three are missing.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The method proposed is reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. The paper is well written and explained.
2. The results should be presented to BraTS 2021 dataset or similar other dataset in the extended version of the work.
3. There should be a subsection to discuss the complexity of the proposed architecture in comparison to SOTA.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

In general, medical data lacks in information in form of modalities and this is current area of research. The paper presented a new mmFormer architecture which can tackle this issue well.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #2

Please describe the contribution of the paper

The authors propose a hybrid network that combines CNNs and transformers for segmentation of brain tumors from multimodal MRI inputs with missing sequences. The network is well designed, the authors motivate the need and provide a good description of the various modules. The network was trained and evaluated on a standard dataset enabling easy comparison with SOTA models. However some implementation details are lacking. Model size, training duration and inference speed are not provided. Dice coefficient was the only metric used for evaluation. Though the splits similar to that in ref 21 were used, comparison with the results in [21] is not provided.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The model developed here addresses an important clinical need as in many cases a complete set of all 4 MRI sequences are not acquired. The network combines the extracted features per input MRI sequence using an elegant modality correlated encoder. Forcing the network to learn meaningful representations from individual encoder for a specific MRI sequence using the auxiliary regularizer is novel.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

No model size / number of parameters is provided and comparison with the Adversarial Co-training Network in ref [21] is not provided. Though ablation studies were performed to understand the contribution of various components of the model, it is hard to discern how modeling the long range interactions with transformer modules is improving the segmentation (does it help with better segmentation of larger lesions and/or does it have any effect on smaller lesions?)
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Certain implementation details are missing:
- Number of initial or filters per level for the CNN encoders
- patch size for the intra-modal transformers is not provided / were the inner-most features of the CNN encoder just flattened?
- what features from the encoder are forwarded as skip features? If it is the features from CNN encoders at specific levels, how are they combined across modalities?
- from Fig.1, it appears as if the intra and inter-modal transformers are used only at the bridge/innermost level. If so, what is used as skip features at every level in the decoder?
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- There is some redundancy in the text in the last paragraph of introduction and beginning of the methods section. This could be shortened to provide more / clarify the implementation details.
- Suggest writing the loss terms in terms of outputs from shared-weight decoder and convolution decoder with outputs at every level in Equation [9]. If the first term is the summation of losses from modality specific encoders + shared-weight decoder, does it imply that loss will be greater when more input modalities are available?
- Please provide model size in terms of model parameters, memory requirements and training duration for 1000 epochs.
- From the description in experiments and results section: ‘For a fair comparison, we use the same data split in [21] and directly reference the results’, but comparison with ACN model in [21] is not provided in Table 1. Please update Table 1 with missing information.
- For Fig.2, suggest to provide all input MRI sequences to help see what lesion information is available in each input modality and the lesion information could be provided as magnified insets. There is less tumor heterogeneity in the example shown, if space permits please include another example with tumor heterogeneity.
- Similar to Table 1, please update Table 2 with results from ACN in ref [21].
- It is hard to interpret results in Table 3. The drop/improvement in segmentation performance for a specific tumor type could be explained by how much of a shared representation is present across modalities and how well the model is able to capture this shared information; and better capturing long range interactions with transformer modules. Please try to dissociate these factors for the ablation studies in Table 3. Maybe, use the ablation experiments with all input modalities to understand the contribution of transformers for modeling long range interactions and for missing modalities, group them with number of missing input modalities. This would be representing the data in Table 3 differently, as you already have the results for various models.
- This might shed more light on why you see more improvement for enhancing tumor even though its size is usually smaller than the core and whole tumor. It might be that most of the improvement could be driven by better learning of shared information than modeling long range interactions with transformers (probably not discernable from your current results, this would require a configuration without the transformers).
- The sample size of 285 is small for large models. Following similar training and validation folds as SOTA models enables fair comparison. Suggest considering CV-splits on a few subsets of input configurations to understand the variability in model generalization.
[21] Wang, Y., Zhang, Y., Liu, Y., Lin, Z., Tian, J., Zhong, C., Shi, Z., Fan, J., He, Z.: ACN: Adversarial co-training network for brain tumor segmentation with missing modalities. In: International Conference on Medical Image Computing and Computer Assisted Intervention. pp. 410–420. Springer (2021)
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Absence of implementation details and comparison with ACN in ref[21]. Looking at table 1 in ref [21], the best performance using the mmFormer (77.61, 85.78, 89.64) is marginally better than ACN (77.46, 85.18, 89.22). However, the average performance across the 15 models is slightly better for ACN model (61.21, 77.62, 85.92) compared with mmFormer (59.85, 72.97, 82.94). It would be nice to have justification in terms of model size / usability / generalizability. Not sure if comparison with Table 1 in ref [21] is valid as it is unclear if the splits in ACN were saved and used in this work or if another random splitting was performed!
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

7
[Post rebuttal] Please justify your decision

The authors provided more clarifications and implementation details in the rebuttal. The comparison with ACN in terms of training time and the number of models trained helps to better understand the merits of this approach.

Review #3

Please describe the contribution of the paper

In this paper, authors exploit Transformer, named Multimodal Medical Transformer (mmFormer), to build a unified model for incomplete multimodal learning of brain tumor segmentation. Experimental results demonstrate the effectiveness and robustness of the proposed method. The paper is good, and the figures are clearly drawn.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is of certain novelty to apply transformer as a new tenchinique in solving incomplete modality segmentation. It has good knowledge of background about multimodal brain tumor segmentation.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The motivation is clear and the significance is strong both in clinical and pure research. However, it is difficult to know why transformer is chosen. As a popular work, transformer is well used in several medical image processing tasks. From the sentence that “the dedicated Transformer for multimodal modeling of brain tumor segmentation has not been carefully tapped yet, letting alone the incomplete multimodal segmentation.”, it seems like that the work is proposed in order to apply a new method on a specific task instead of that the task needs the method to get improvement. Of course, it is easy to understand we try many methods and find one is good. But why it is good and why you choose it should be clearly presented in the submitted paper.
2. The related work should be carefully abstracted. Compared with brain tumor segmentation, incomplete modality of brain tumor segmentation is a specific field, in which there is not that large number of papers. You may categorize and comment methods following their main ideas and talk about the pros and cons class by class.
3. Why do the U-HeMIS and U-HVED are chosen as your benchmark? Is it because they are all latent space based model?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

reproducible if hyperparameters are all provided, however not
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Basically speaking, I understand that you try a new way to solve the problem, although I don’t know why you use it. If solid reasons are provided, it would be nice. Besides, the validation is limited, maybe due to the paper length limitation. There are several algorithms on this topic, more comparison would be persuasive.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

written. The motivation to use transformer should be cleared presented.
Number of papers in your stack

3
What is the ranking of this paper in your review stack?

4
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

The main concern of mine is why transformer is chosen here and the authors state it in more details. No more concerns for me. However, the validation still needs to be improved if possible. I guess it’s limited by the length of entire paper. So, it’s fine with current version.

Review #4

Please describe the contribution of the paper

This work is the first attempt to apply Transformer for incomplete multimodal brain tumor segmentation task.

The proposed mmFormer not only outperforms the SOTA methods on the BraTS benchmark and but also does well in the situation of missing modalities.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This work designs hybrid modality-specific encoder for each modality to extract specific feature for each modality. The encoder utilizes Transformer to build the long-range dependencies after the down-sample by CNN.

This work designs modality-correlated encoder to fuse the feature between different modalities. It is a great method of fusing feature with Transformer rather than simple feature fusion such as concatenation and MLP.

This work introduces auxiliary regularizers are to further enhance the model’s robustness to incomplete modalities.

The proposed method outperforms the previous SOTA methods in both complete and incomplete multimodal segmentation and it gains obviously improvement to indicate the effectiveness of proposed mmFormer.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

A small mistake of Modality-correlated Encoder takes place in Figure 1. The Value is multiplied by relation matrix of Query and Key.

Also, the effectiveness of the Bernoulli indicator is not clear.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The details of the proposed method are distinctly clarified. The training setup and the details of the used dataset are also demonstrated. This work claim that the code will be available after the blind review. So, it ensures the reproducibility of this work.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

It is interesting to visualize the details of the Transformer for IntraTrans or InterTrans to further investigate the effectiveness of the Transformer in brain tumor segmentation.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method gets obviously improvement with Transformer in both complete and incomplete multimodal segmentation. Thus, we can see the potential of Transformer in not only the brain tumor segmentation of this work but also various tasks in medical images.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Somewhat Confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #5

Please describe the contribution of the paper

In this paper, the authors propose mmFormer for segmenting tumor regions from brain data captured by multiple MRI modalities. The effectiveness of mmFormer is demonstrated through experiments evaluating its performance on the BraTS2018 dataset.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Refinement of features for each modality extracted by the U-Net encoder using Transformer.
- Improvement of segmentation results generated by the U-Net decoder through weighted combination and refinement of features.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Low reproducibility due to lack of explanation of network architecture.
- Lack of accuracy comparison with related work.
- Insufficient explanation of experimental conditions.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Even if one understands the content of the paper, it is not possible to implement the method because the details of the network architecture are not explained. It is also not possible to reproduce the experiment because the detailed experimental conditions are not described.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. Multimodal segmentation methods using Transformer, such as TransBTS and Unetr, have been proposed. In contrast, can the authors claim to be the first to use Transformer for brain tumor segmentation? The structure of mmFormer shows that the features extracted by the U-Net encoder are refined by Transformer, weighted and combined, and a segmentation mask is generated from them by the U-Net decoder. Although there is a difference in whether feature extraction is performed on a modality-by-modality basis or by combining modalities, both conventional methods and mmFormer are a combination of U-Net and Transformer.
2. Generally, features are refined by repeating the Transformer block multiple times, but mmFormer only goes through the Transformer block once. Why is mmFormer not repeated multiple times?
3. In the modality-correlated encoder, there are parameters, i.e., deltas. Are they hyperparameters that are trained or scalar values that are pre-determined? If they are scalars, the values used in the experiment should be clearly presented in the paper.
4. Do the denominators $g_i^c$ and $p_i^c$ in equation (8) indicate squares? If so, parentheses should be used appropriately for clarity. Why is the range of the sum different for the encoder and decoder in equation (9)? The range of the decoder is $i$=1~$l$-1. Since $M$ contains 4 elements, do you use the sum of the 3 losses for the decoder? Since equation (8) uses $i$, it is better to use $m$ instead of $i$ in equation (9).
5. The description of the experimental conditions in this paper is insufficient. In general, it is necessary to divide the data into training, validation, and test data, but there are no specific conditions. The specifications of the Conv Encoder and Conv Decoder of mmFormer are not clear.
6. How does mmFormer perform when single modality is used as input in Table 1? It explains that the delta of the modality not used is set to zero. As shown in Figure 1, it is assumed that four modalities are input, but how do you handle the modalities that are not used? If one modality is the input, then mmTransformer can be thought of as a method of refining U-Net intermediate data with the Transformer.
7. Why is BraTS2018 used in the experiments in this paper, but not BraTS2019 or BraTS2020? In TransBTS, accuracy has been evaluated using BraTS2019 and BraTS2020. In order to demonstrate the effectiveness of mmTransformer, it is necessary to compare its accuracy with that of Transformer-based methods. Why is the comparison with TransBTS mentioned in the text but not included in Table 1? Unetr can be compared to mmTransformer because its implementation is publicly available.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My decision is weak reject because there are many unclear points as mentioned in the detailed comments and the paper needs to be revised.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

I have carefully checked the authors’ rebuttal. Some concerns have been addressed, but they do not explain why they only did the evaluation at BraTS2018. I assume there is some negative reason because other MICCAI2022 papers I have reviewed have performed experiments using BraTS2020 and BraTS2021. Based on the above, my rate is still weak reject as before.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
This article proposes an architecture using the Transformer, named Multimodal Medical Transformer (mmFormer) for brain tumor segmentation with incomplete multimodal learning. There is a certain novelty in applying the Transformer, which is a new technique, to solve the problem of segmentation with incomplete modalities. To remove some ambiguities, the authors need to address some critical points in a rebuttal listed below:
- The paper does not show what is the motivation to use the transformer to deal with incomplete data.
- The paper should give more details on the implementation to make reproducibility possible for readers and on explanation of experimental conditions.
- It lacks an accuracy comparison with some recent and related works, for example, the comparison with ACN in ref [21].
- It is assumed that there are four input modalities. What are the input in case of missing modalities in the test procedure?
- Why did the authors not use BraTS 2021 which is available?
- It is necessary to better interpret the results in Table 3. The authors are also invited to make the corrections indicated by the reviewers.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

5

Author Feedback

We would like to acknowledge all the reviewers and ACs for their careful reviews and time. Overall, all reviewers recognize the novelty and effectiveness of mmFormer. Besides, most reviewers (R1, R2, R3, R4) agree on the good organization and clarity of this manuscript. Meanwhile, reviewers also highlight the clinical significance (R1, R2, R4) and solid motivation (R3, R4) of this work.

Our responses to main concerns are listed below:

The reason choosing Transformer (R3) We have clarified the motivation to explore Transformer for incomplete multimodal learning in the introduction (the last 7 lines of the 3rd paragraph on page 2). In incomplete multimodal learning of brain tumor segmentation, the features extracted with limited receptive fields (e.g. CNN) tend to be biased while dealing with varying modalities. Transformer is effective to model long-range dependencies and thus contributes to learning the modality-invariant representation.

Implementation details and experimental conditions (R2, R5) The number of filters at each level of the 5-stage CNN Encoder is 16, 32, 64, 128, and 256, respectively. The features from CNN encoders of different modalities at a specific level are concatenated and forwarded as skip features to the CNN decoders. Meanwhile, the outputs of CNN encoders are directly flattened to 1D sequences and fed to the intra-modal transformers. The mmFormer has 106M parameters and 748G FLOPs. The model is trained for about 25 hours and 17G memory on each GPU. We will clarify the information in the revision. The data split is obtained from the authors of ACN and thus the comparison is valid.

Comparison with previous works On HeMIS and U-HVED (R3): As mentioned in the introduction (2nd and 3rd paragraphs on page 2), existing methods for incomplete multimodal learning can be categorized into “image synthesis”, “knowledge distillation”, and “shared latent space”. Our mmFormer intends to learn a modality-invariant representation in the shared latent space. Therefore, we compare it with HeMIS and U-HVED (representative “shared latent space” methods).

On ACN (R2, R3): ACN belongs to the stream of “knowledge distillation”. In the case of N modalities, ACN has to train 2^N-2 times to distill 2^N-2 student models for all conditions of missing modalities, while our mmFormer only learns once by a unified model. Specifically, ACN is trained for 672 hours with 144M parameters (1 teacher & 14 students), while mmFormer requires only 25 hours with 106M parameters. Nevertheless, as mentioned by R2, the average performance of mmFormer (59.85, 72.97, 82.94) is still close to ACN (61.21, 77.62, 85.92). We will provide more discussion in the revision.

On TransBTS and UNETR (R5): We did not claim to be the first work using transformer for multimodal segmentation but the first to explore it for INCOMPLETE multimodal learning. None of recent transformer-based methods investigates this setting, which is of high clinical value.

The input in case of missing modalities for testing (R5) Referring to Eq. (6), in case of missing modalities, the multimodal token for the missing modalities will be held by a zero vector. From the view of mmFormer, the number of input modalities is always 4, despite the missing modalities are zero vectors.

Why use BraTS 2018 instead of the latest ones (R1, R5) BraTS 2018 is widely used in recent works [7] [10] [21] [22] for multimodal brain tumor segmentation and thus we believe it is also valid for our method. We will evaluate mmFormer on more datasets in the future work.

Table 3 (R2) In table 3, both IntraTrans and InterTrans obtain non-trivial improvement in average for all combinations of available modalities. it shows mmFormer learns a robust shared representation by building long-range dependencies. We sincerely appreciate the suggestion from R2. Due to limited space, we are unable to elaborate on the remaining issues but promise an in-depth revision regarding those issues.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This article proposes an architecture using the Transformer, named Multimodal Medical Transformer (mmFormer) for brain tumor segmentation with incomplete multimodal learning. The network combines features extracted from each MRI sequence using a modality correlated encoder. It allows to force the network to learn meaningful representations from an individual encoder. There is a certain novelty in applying the Transformer, which is a new technique, to solve the problem of segmentation with incomplete modalities. The authors’ responses in the rebuttal help clarify critical points. Overall, my proposition is therefore “acceptance”.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

1

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal covers and answers to the main points raised by the reviewers and meta reviewer and the paper presents an interesting and valid solution to an important problem
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

4

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

There is a common consensus in that the idea of applying transformers for multi-modal segmentation with missing modalities is somehow novel, for which I side. The authors addressed most of the concerns in their rebuttal. However, the major criticism, and for which authors failed to answer in a convincing way, is the motivation for using BRaTS’18 as evaluation benchmark. Despite this, I do not think that the conveyed message would change significantly regardless the dataset used.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

NR

back to top

mmFormer: Multimodal Medical Transformer for Incomplete Multimodal Learning of Brain Tumor Segmentation