Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Zecheng Liu, Jia Wei, Rui Li, Jianlong Zhou

Abstract

People perceive the world with different senses, such as sight, hearing, smell, and touch. Processing and fusing information from multiple modalities enables Artificial Intelligence to understand the world around us more easily. However, when there are missing modalities, the number of available modalities is different in diverse situations, which leads to an N-to-One fusion problem. To solve this problem, we propose a self-attention based fusion block called SFusion. Different from preset formulations or convolution based methods, the proposed block automatically learns to fuse available modalities without synthesizing or zero-padding missing ones. Specifically, the feature representations extracted from upstream processing model are projected as tokens and fed into self-attention module to generate latent multimodal correlations. Then, a modal attention mechanism is introduced to build a shared representation, which can be applied by the downstream decision model. The proposed SFusion can be easily integrated into existing multimodal analysis networks. In this work, we apply SFusion to different backbone networks for human activity recognition and brain tumor segmentation tasks. Extensive experimental results show that the SFusion block achieves better performance than the competing fusion strategies. Our code is available at https://github.com/scut-cszcl/SFusion.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_15

SharedIt: https://rdcu.be/dnwxX

Link to the code repository

https://github.com/scut-cszcl/SFusion

Link to the dataset(s)

https://www.med.upenn.edu/cbica/brats2020/


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper the author’s propose a self-attention based fusion strategy to take K input modalities and output a feature representation where instead of K modalities there is 1. The method is compared to baselines using two tasks - multimodal brain tumour segmentation and activity recognition from sensor data. The method called Sfusion contains two separate modules - the correlation extraction module works by taking the input modalities, f, and splitting them up into flattened vectors of size B x C x T where T is Rf x K, Rf is the spatial dimensions and K the number of modalities. These are then passed through some attention layers and reshaped back to the original dims to give a transformed representation f’. The output from this module is passed to the modal attention module - which takes these transformed representations f’ and computes a per modality and per voxel weight map m’. This weight map is used to get the representation fs which is passed through a decoder.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper introduces a well-motivated fusion mechanism which allows for multi-modal fusion.
    • The method is tested on brain tumour segmentation but is actually agnostic to the spatial dimensions and is shown to work on 1D data.
    • Evaluation is strong with a good attempt at taking the most comparable work ([5]) and re-implementing it, then replacing their fusion layer (GFF) with Sfusion and finding that Sfusion works better.
    • Ablation study is provided to provide justification for the design choices of Sfusion.
    • Wilcoxin test used to measure statistically significant improvements.
    • Great table comparing to other methods with the same Brats table.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • For the purposes of this conference it is not hugely relevant that the fusion module is shown to work on this locomotion task - perhaps a task in robotics control in relation to computer assisted intervention would make it a better MICCAI paper. Equally future work could explore the behaviour of this fusion mechanism for multi-modal 2D images.
    • The methods section particularly 2.3 could use some improvement - detailed feedback in section 9.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Outstanding reproducibility as they provided code and used open-source datasets for their work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • Section 2.3 could have some equations to show how fs is calculated from f’ and m
    • Figure 3 is nice but I believe if it was written out it would be easier to follow.
    • I have not done a sufficient deep dive into the literature but I feel that this statement “However, no work has explored the effectiveness of self-attention mechanism on the N-to-One fusion problem.” is likely wrong and would either remove it or amend it.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a fantastic MICCAI paper - method is well-justified and explained. Experiments are well chosen. Baselines are well-studied and re-implemented. Every attempt is made to make sure that the method is comparable to the baselines, keeping parts of the architecture that are not being compared the same. Code is provided, which was useful to the reviewer during his review.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a self-attention-based fusion block to solve the N-to-One fusion problem when dealing with missing modalities. Unlike existing fusion strategies, the proposed method automatically learns to fuse available modalities without synthesizing or zero-padding missing ones. The method is applied to different backbone networks for human activity recognition and brain tumor segmentation tasks. Experimental results demonstrate that SFusion achieves better performance than competing fusion strategies.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The proposed self-attention-based fusion block SFusion is a novel approach of handling the N-to-One fusion problem caused by missing modalities. It automatically learns the latent correlations between different modalities and builds a shared representation adaptively. 2) The method is data-dependent and does not impute missing modalities, which is beneficial to avoid extra computation and potential bias. 3) The proposed method is not limited to specific deep learning architectures, and it can be easily integrated into existing multimodal analysis networks. 4) The experimental results demonstrate the superior performance of SFusion over competing fusion strategies.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Although the authors provide extensive experiments on two different tasks, human activity recognition and brain tumor segmentation, it is unclear how the method could be generalised to other types of data. 2) It seems that this is not the first self-attention n-to-one fusion method, as the authors claimed in the paper.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good reproducibility. The authors provide the model code with instructions for running it.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    1) Although the work was validated on a medical dataset (BraTS2020), the clinical relevance of the proposed method is not evident.

    1. While the paper proposes the SFusion block as a self-attention-based fusion strategy, it fails to demonstrate its strengths in comparison to cross-attention methods. The necessity and advantages of using self-attention rather than cross-attention are not demonstrated.
    2. The paper states, “no work has explored the effectiveness of self-attention mechanism on the N-to-One fusion problem.” It seems that other works have attempted this direction, such as [1] Hu Zhu, Ze Wang, Yu Shi, Yingying Hua, Guoxia Xu, Lizhen Deng, “Multimodal Fusion Method Based on Self-Attention Mechanism”, Wireless Communications and Mobile Computing, vol. 2020, Article ID 8843186, 8 pages, 2020. https://doi.org/10.1155/2020/8843186
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the proposed method is interesting and the experimental results are promising, there are some limitations, including the lack of comparison with state-of-the-art methods and the potential limited generalizability to other types of data. The paper could benefit from addressing these limitations.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    This paper has scientific merit but has weaknesses in its presentation of experimental design and clinical value.

    From the A2, although this model is not specifically designed for medical imaging and could be generalisable on other data types, the reviewer is not fully convinced by its generalizability on medical imaging, which is desirable for future work.

    According to their A3, SFusion contains implicit cross-attention, which further raises doubts about whether its excellent performance is leveraging cross-attention or “the effectiveness of self-attention mechanism”.

    Previous papers have compared cross-attention and self-attention methods, e.g., in [1]. The reviewer supposes the authors could make comparisons with cross-attention methods if the N is set to a fixed value, N=2 or N=3.

    [1] Rajan, Vandana, Alessio Brutti, and Andrea Cavallaro. “Is cross-attention preferable to self-attention for multi-modal emotion recognition?.” ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.



Review #3

  • Please describe the contribution of the paper

    This work provides a self-attention-based SFusion block for addressing the missing modalities in the multi-modal tasks. Based on the self-attention fusion block, the SFusion can automatically learn to fuse available modalities without padding or synthesizing information about missing modalities.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    use self-attention to fuse multi-modalities infromation by ignoring the missing modalities and achieve good performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lack of comparisons with other methods addressing missing modalities, such as the listed methods in Fig. 1(a)-(c).
    2. Further discuss whether the improved performance is from the self-attention-based block fusing multi-modalities feature or the ignorance of the missing modalities, maybe can do a compared experiments.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the paper is adequate with open code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The compared experiments do not reflect the motivation of the paper that SFusion outperforms the other strategies addressing missing modalities, such as the listed methods in Fig. 1(a)-(c). Do not clear the improved performance is from fusing multi-modalities with self-attention or ignoring the missing modalities.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work provides a self-attention-based SFusion block for addressing the missing modalities in the multi-modal tasks. Based on the self-attention fusion block. The experimental results can not effectively explain the advantages of the SFusion compared with padding, selection, and convolution-based strategies. The good performance maybe comes from the attention-based multi-modalities fusion but not from the addressing of missing modalities.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Author dealt with the comments well, but still needs to revise the concerned questions in the final revised manuscript.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a self-attention-based fusion block to solve the N-to-One fusion problem when dealing with missing modalities. The proposed method automatically learns to fuse available modalities without synthesizing or zero-padding missing ones. In general, the proposed method is interesting. However, there are some issues as follows: 1) This statement “However, no work has explored the effectiveness of self-attention mechanism on the N-to-One fusion problem” is not reasonable; 2) It is unclear how the method could be generalized to other types of data, as the clinical relevance of the proposed method is not evident; 3) The necessity and advantages of using self-attention rather than cross-attention are not demonstrated; 4) The compared experiments do not reflect the motivation of the paper that SFusion outperforms the other strategies addressing missing modalities.




Author Feedback

We sincerely thank Meta-review and all the Reviewers (R#1, R#2, R#3) for the thoughtful feedback. Below, we first answer general concerns and then respond to the other points.

General concerns:

Q1 (R#1, R#2, Meta-review). The statement “However, no work has explored the effectiveness of self-attention mechanism on the N-to-One fusion problem” is not reasonable. A1: We acknowledge that self-attention has been studied in the multi-modal fusion problem. However, existing studies focused on multi-modal fusion with fixed N, therefore, individual models need to be trained for different missing modality cases. In contrast, our work focuses on the N-to-One fusion scenario, where N is variable during training, rather than fixed. It allows for the construction of a unified model capable of handling varying numbers of modalities. To avoid any misunderstandings, we will add this detailed explanation in the revision.

Q2 (R#1, R#2, Meta-review). Method generalizability to different data types is unclear, as clinical relevance remains unclear. A2: Due to space constraints, we conducted two tasks to validate the effectiveness of SFusion. The brain tumor segmentation task was chosen to evaluate SFusion’s performance in medical image processing. The human activity recognition task was selected to assess SFusion’s generalizability and its effectiveness in the non-medical domain. In future work, we will explore the behaviour of SFusion in other clinical-related tasks.

Q3 (R#2, Meta-review). The necessity and advantages of using self-attention rather than cross-attention are not demonstrated. A3: Actually, the features of different modalities are concatenated in SFusion before self-attention, so there have implicit cross-attention during the self-attention process. In addition, explicit cross-attention is typically performed between two or three modalities. However, when processing more modalities and some modalities are missing, determining the rules for cross-attention becomes challenging. Currently, we have not found a suitable cross-attention method for comparison.

Q4 (R#3, Meta-review). The compared experiments do not reflect the motivation of the paper that SFusion outperforms the other strategies addressing missing modalities, such as the listed methods in Fig.1(a)-(c). A4: In the human activity recognition task, we compared SFusion with the selection strategy (Fig.1(b)) used by EmbraceNet [9]. In the brain tumor segmentation task, we compared SFusion with GFF [5] belonging to the convolution strategy (Fig.1(c)). Due to space limitations, we do not directly compare SFusion with the mean stragegy shown in Fig.1(a). However, in the ablation experiments of reference [5], GFF outperforms the mean strategy. Therefore, we believe SFusion can surpass the mean strategy. These experiments can demonstrate that SFusion outperforms the other fusion strategies of Fig.1(a)-(c).

The other concerns:

R#1: Q1. Section 2.3 could include equations for calculating fs from f’ and m. Figure 3 is nice but I believe if it was written out it would be easier to follow. A1: Due to space constraints, we removed the equations. We will strive to add them and provide more descriptions in the revision.

R#3: Q1. Do not clear the improved performance is from fusing multi-modalities with self-attention or ignoring the missing modalities. A1: The ablation experiments on both tasks show that self-attention improves performance. The mean strategy performs worse than GFF [5], indicating that simply ignoring missing modalities is not sufficient to improve performance. The convolution fusion modules handle missing modalities through zero-padding but require reconstruction and retraining when the number of modalities varies. SFusion does not require such reconstruction due to it is a unified model.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Based on the feedback of the authors and the combined comments of the reviewers, we have decided to accept this paper. The authors should also revise the paper according to the raised comments in the final version.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposes an attention-based approach to deal with missing modalities in medical image analysis. Being able to learn and predict with missing modalities is a relevant anc challenging problem in many scenarios. Authors show competitive results on a missing modality set up using the BraTS data set. In their rebuttal, authors were asked to clarify some statements and experiments, which they did well. Authors have addressed the comments of the reviewers, and the most critical reviewer has raised the paper’s rating from 3 to 5, indicating that the authors have addressed the comments well.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a self-attention based fusion mechanism to address the challenge of missing modalities in multimodal learning tasks. Overall this is a good MICCAI paper and reviewers generally agree on accepting it albeit with different degrees of enthusiasm. The author feedback sufficiently addresses the feedback on novelty in relation to other works studying self-attention in multimodal fusion and the query on cross-attention.

    I think the paper addresses an important problem with clear novel contributions Recommend acceptance. Suggest that the authors include a supplement showing comparison with mean strategy (Fig 1a) and also a discussion on possible medical imaging use cases that can leverage the proposed method.



back to top