Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Luyi Han, Tianyu Zhang, Yunzhi Huang, Haoran Dou, Xin Wang, Yuan Gao, Chunyao Lu, Tao Tan, Ritse Mann

Abstract

Multi-sequence MRI is valuable in clinical settings for reliable diagnosis and treatment prognosis, but some sequences may be unusable or missing for various reasons. To address this issue, MRI synthesis is a potential solution. Recent deep learning-based methods have achieved good performance in combining multiple available sequences for missing sequence synthesis. Despite their success, these methods lack the ability to quantify the contributions of different input sequences and estimate region-specific quality in generated images, making it hard to be practical. Hence, we propose an explainable task-specific synthesis network, which adapts weights automatically for specific sequence generation tasks and provides interpretability and reliability from two sides: (1) visualize and quantify the contribution of each input sequence in the fusion stage by a trainable task-specific weighted average module; (2) highlight the area the network tried to refine during synthesizing by a task-specific attention module. We conduct experiments on the BraTS2021 dataset of 1251 subjects, and results on arbitrary sequence synthesis indicate that the proposed method achieves better performance than the state-of-the-art methods. Our code is available at https://github.com/fiy2W/mri_seq2seq.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43999-5_5

SharedIt: https://rdcu.be/dnwjf

Link to the code repository

https://github.com/fiy2W/mri_seq2seq

Link to the dataset(s)

http://braintumorsegmentation.org/


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a supervised image-to-image translation (I2I) approach for multi-sequence MR image synthesis. The proposed method is based on Seq2Seq, with several modifications and improvements, including the use of interpretable weights in anatomy fusion and a task-specific attention module for refined results. These contributions are interesting and novel in the field. The experiments conducted on the BraTS dataset show superior performance over existing I2I methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The use of interpretable weights in information fusion is novel and interesting. This provides extra information about the model behavior, which is very important in clinical settings.
    2. The experiments and comparisons are relatively well-designed and informative. The experiments are conducted on a public dataset and use multiple evaluation metrics, providing a comprehensive evaluation of the proposed framework.
    3. The concept of introducing pixel-level uncertainties/refinements to the method through the task-specific enhanced map (TSEM) is interesting. This provides extra interpretability and transparency for potential applications in clinical settings. Furthermore, the fact that the TSEM highlights the tumor region is promising. I also want to point out that while I like the general concept of having a pixel-level uncertainty map, I do think the current TSEM has issues to address, which I discuss in Weaknesses and Constructive feedbacks sections.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. While the concept of introducing a pixel-level uncertainty map is interesting, the authors’ approach for calculating the task-specific enhanced map (TSEM) is not well-motivated. The task-specific weighted average (TSWA) module is designed to provide optimal features for synthesis, yet the authors then use the task-specific attention (TSA) module to refine these features, which is based on the same input as the TSWA. This is counterintuitive and lacks motivation. Moreover, the results show only a slight improvement (within less than half of a standard deviation) after introducing the TSA module, and it is unclear whether it would be more effective to use only the TSA module for the task. The authors need to address these issues to clarify the design choices and their impact on the proposed method’s performance.
    2. The paper lacks statistical comparisons, which are necessary to support the conclusions drawn from the numerical results. Although the proposed method appears to perform better than the baseline methods in terms of mean values, the differences are not obvious after considering standard deviations. Statistical tests would provide a more convincing evaluation of the proposed method’s performance.
    3. The proposed method lacks clear demonstration of its clinical applicability. The paper is intended to address the problem of missing sequences in multi-sequence MRIs, but the authors did not explicitly investigate how missing data during training could affect its performance. Although the authors mentioned the use of zero-filled placeholders to handle arbitrary input sequence combinations, it is unclear how well this mechanism can simulate real missing data (for instance 50% of T1-Gd are missing). Evaluating the method’s ability to handle missing data during training is crucial for clinical applicability, as missing data can occur in both training and testing phases. The authors should address this issue to ensure the proposed method’s robustness and practicality.
    4. The writing needs a little improvement for conciseness and clarity. Several key pieces of information in the methodology require further elaboration. Although the paper is based on the arXiv paper Seq2Seq, it should be self-contained according to MICCAI review guidelines, as arXiv papers should not be considered as prior works. Specifically, the authors need to provide more information about 1) how and why the encoder E can reduce the distance between different sequences at the feature level to facilitate more stable fusion, which is a crucial point for the entire work, and 2) provide more details about the HyperConv module.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducibility is good. Public datasets in development and evaluation. Publicly available code (up on acceptance). Reported both mean and standard deviation in comparisons.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. The authors should investigate the proposed method’s capability to handle missing data not only during testing but also during training. Specifically, they could explore the impact of different percentages of missing MR sequences during training and evaluate how this affects the method’s performance in handling missing sequences during testing. This investigation will provide a more comprehensive understanding of the proposed method’s robustness and practicality in real-world clinical settings.
    2. While the proposed method’s performance on BraTS data is promising, it is important to evaluate its generalizability to other datasets. Therefore, the authors should consider exploring the proposed method’s performance on other public datasets and investigate the possibility of training the model on multiple datasets. This could help evaluate the proposed method’s capability to handle different sources of variation in multi-sequence MRIs. Additionally, this will also improve its clinical applicability. Imagine now that one sequence is completely missing in one dataset but available in another dataset, incorporating multi-dataset training could help the model handle more severe cases of missing data, thus improve applicability.
    3. The authors should provide more details on the motivation for introducing the TSA module and its necessity in the proposed method. It is unclear why the TSWA module alone cannot achieve the desired refinement, and the authors should clarify this point. Additionally, an ablation study is recommended to evaluate the performance of the TSA module alone, without the TSWA module. I recommend the authors should think about other approaches to provide such pixel-level uncertainty or provide stronger motivations for introducing both TSA and TSWA.
    4. The use of TSWA is very intriguing and interesting. Several recent papers [1,2] have also explored using weighted sum in feature space to achieve interpretability and robustness. This is an emerging research aspect that needs to be extensively explored in the future.
    5. It is recommended that the authors train their models end-to-end to further improve the efficiency and effectiveness of the proposed method. This would eliminate the need for multiple stages of training and potentially lead to better performance.
    6. The authors should explore the potential of the proposed method in other tasks, such as segmentation and classification. Further evaluations on downstream tasks would provide more comprehensive insights into the proposed method’s applicability and effectiveness.
    7. The authors should highlight the key differences between these methods in Fig. 2.

    [1] Liu et al. “One Model to Synthesize Them All: Multi-contrast Multi-scale Transformer for Missing Data Imputation.” TMI 2023. [2] Zuo et al. “HACA3: A Unified Approach for Multi-site MR Image Harmonization.” arXiv preprint 2022.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the proposed method has some interesting contributions in the use of interpretable weights and task-specific attention for image synthesis, there are several weaknesses that need to be addressed, including the need for further explanation of the task-specific attention module, and the absence of statistical tests. Additionally, the paper did not explicitly evaluate the proposed method’s ability to handle missing training data, which is a significant limitation in clinical settings. These weaknesses slightly outweigh the strengths of the paper, and I believe addressing these issues will significantly improve the quality and clinical impact of the proposed method.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Based on the authors’ responses to the reviewers’ concerns, I’m pleased to recommend accepting the paper. The authors have addressed my major concerns by conducting statistical tests to demonstrate the significance of the performance gain. Additionally, the authors in their rebuttal provided insights into the motivation and usage of the task-specific attention (TSA) module. These responses have strengthened the paper’s contribution and addressed key issues raised during the review process. However, it is important to acknowledge that limitations exist in any research work, and the concern regarding the handling of missing/unbalanced data remains.

    For their journal submission, I strongly encourage the authors to thoroughly investigate and validate the proposed method’s capability to handle missing data, as well as to consider and address the additional constructive feedback I provided. Overall, the paper meets the standards of MICCAI and demonstrates the potential to evolve into a substantial journal submission.



Review #3

  • Please describe the contribution of the paper

    In this study, the authors proposed a TSF-Seq2Seq model for MRI missing modality synthesis. This work is featured with model explanation, especially in contribution sequences analysis and task-specific enhanced map visualization. The authors used a public BraTS2021 dataset with more than 1000 patients to develop this model, the results showed the proposed model outperformed other state-of-the-art methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The problem addressed by this paper is clinically relevant. Multi-parametric MRI data is highly demanded to improve the model performance due to the complementary information provided. However, in clinical, there are lots of missing modality data, which greatly reduces the training data. Developing a missing modality synthesis technic is helpful to utilize the MRI information from existing modalities. One of the strengths of the paper is the model explainability, which helps analyze the contributing input sequences and target-specific enhance map. Although this is not a completely explainable model, it is good that the authors took their efforts to make the results interpretable.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Overall, this paper is well organized with novelty in model explainability. I only have some minor comments: (1) If the model considers the data imbalance issue? In clinical, most patients (for example, 90%) missed T1-weighted MRI and few patients (10%) missed T2-weighted MRI, if the model still can perform well in this situation? (2) Fig. 2 illustrated an example result to synthesize T2-weighted MRI, which is a relatively simple task. I’m wondering if the authors can show some results of synthetic T1Gd? Synthesizing T1Gd is a more challenging task because different from routine sequences such as T1-weighted and T2-weighted MRI, additional contrast agent will be applied during MR scanning, leading to richer information in T1Gd. I would like to know if the model can still perform well on this more challenging task. (3) Please keep the terms consistent, for example: Fig. 1/Figure 1, Fig. 2/Figure 2. (4) typo: “column” in Fig. 3 of supplementary material.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Limited reproducibility. This is an interesting work. It would be better if the authors could make corresponding codes public to make a better contribution to the community.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    In future work, it would be better if the authors could take the model generalizability into account in this task, because MRI is not a quantitative imaging modality, due to the heterogenous nature of MRI dataset across institutions, the DL-based model may suffer from significant performance drop when testing on external dataset.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    clinical value of utilizing missing MRI sequences and efforts to improve the model explainability

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The authors’ response did not well addressed my concerns.



Review #4

  • Please describe the contribution of the paper

    The authors propose an explainable synthesis method, which adapts weights for specific MRI modality synthesis tasks and provides some interpretability from two aspects: (1) visualize the contribution of each input sequence in the fusion stage by a trainable task-specific weighted average module; (2) highlight the area the network tries to refine during synthesizing by a task-specific attention module. The backbone is based on a recent seq2seq model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is built upon a recent seq2seq model for image synthesis, and it address two significant problems in multi-modal MR image synthesis:

    1. The fusion of the input modality at the feature level and visualize the contribution of each input modality.
    2. Voxel-level attention map from the task-specific attention module to visualize some kind of uncertainty.

    One point I would like to argue from the results: the comparison with existing CNN-based models shows that seq2seq model might be a better alternative.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weaknesses are the explanation of task-specific enhanced map and some experimental setting details. In Section 2.2., the authors say ‘As f_A is a contextual refinement for fused features, analyzing it can help us understand more what the network tried to do.’ This is correct but not in-depth. I would like to brainstorm a bit here:

    1. Would there be any differences between two attention maps derived from two mappings: ‘T1+T1-c–>T2’ and ‘T1+FLAIR+T1-c–>T2’ in the same subject? This maybe would help us to understand more property of this kind of attention map.

    2. Is there any relation between attention map and uncertainty map? One simple baseline of uncertainty map would be the voxel-wise variation from multiple networks (same architecture but trained with different initialization). Thus, it would be great to see a baseline of uncertainty map here and see if the attention map and uncertainty map highlight the same region. Some discussion about uncertainty [1] could be helpful to highlight the contribution of this work.

    Experimental setting:

    1. The training details such as model architecture and patch size seem to rely on the seq2seq paper. I would suggest the authors provide more details to make it self-contained.

    References: [1] Uncertainty-Aware and Lesion-Specific Image Synthesis in Multiple Sclerosis Magnetic Resonance Imaging: A Multicentric Validation Study. Frontiers in Neuroscience 2022

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    okayish but more details need to presented.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    I like the motivation and the problem being addressed for image synthesis in this paper. Please carefully address the weakness part above by showing a bit more experiments and discussion on its interpretability which is the main contribution of this paper.

    In summary, there are three main points to be addressed (if it is possible):

    1. Would there be any differences between two attention maps derived from two mappings: ‘T1+T1-c–>T2’ and ‘T1+FLAIR+T1-c–>T2’ in the same subject? This maybe would help us to understand more property of this kind of attention map.

    2. Is there any relation between attention map and uncertainty map? One simple baseline of uncertainty map would be the voxel-wise variation from multiple networks (same architecture but trained with different initialization). Thus, it would be great to see a baseline of uncertainty map here and see if the attention map and uncertainty map highlight the same region. Some discussion about uncertainty [1] could be helpful to highlight the contribution of this work.

    3. The training details such as model architecture and patch size seem to rely on the seq2seq paper. I would suggest the authors provide more details to make it self-contained.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    good motivation and contribution to image synthesis. some more experiments and clarification might be needed to better understand the results and text.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Keep my previous rating as ‘weak accept’. Generally a good paper. The authors seem to focus on addressing R1’s concerns which is ok.
    But I would like to ask the authors to try to address my raised points in the latter version.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes an explainable approach to solve the multi-to-one MR synthesis problem. However, R1 raised several questions that need to address in the rebuttal stage. Additionally, I have a question regarding the motivation behind this paper. Specifically, why would introducing attention or task-specific weights improve so-called “explainability”? Furthermore, how could such an improvement potentially benefit image synthesis in a real clinic setting?




Author Feedback

We thank all the reviewers (R) for reviewing and recognizing our work. We answered the questions raised by reviewers and provided more required information.

Q1: Explainability. (MetaR1, R1) A1: Our model tries to explain : (1) How much each input sequence contributes to the model? (2) Which region of the generated image is the model more uncertain about? Task-Specific Weighted Average (TSWA) can show sequence contributions, which may help optimize the clinical scanning protocol by skipping low-contribution scanning. Task-Specific Attention can help calculate the Task-Specific Enhanced Map (TSEM), highlighting the uncertain area. It may help downstream models/physicians to be cautious on less-certain regions, increasing model safety and guiding more accurate image diagnosis.

Q2: Motivation of TSA. (MetaR1, R1) A2: The motivation of TSA begins from the tradeoff between the limitation of the TSWA and its explainability. TSWA provides a linear combination of input features to measure the sequence contribution, which lacks the best weight for every pixel, leading to a few performance drops when fusing multiple sequences. As shown in Table 1, TSF-Seq2Seq (w/o f_A) causes a slight decrease in PSNR and SSIM compared with Seq2Seq (Average) when the number of inputs is 2 or 3. Note that, TSF-Seq2Seq (w/o f_A) refers to Seq2Seq+TSWA, and Seq2Seq (Average) indicates the average of the multiple outputs of Seq2Seq. Of course, it is appreciated that TSWA brings us the contribution of each sequence and a significant improvement in perceptual similarity (LPIPS). Thus, with ample motivation, we proposed the TSA for refining the features fused by TSWA. After using TSA, the performance of all metrics has been improved (Concerns about statistics are explained in A3), and we can highlight uncertain areas by calculating TSEM, which is estimated without data and model uncertainty. Q2.1: TSWA and TSA input with the same features. A2.1: Based on A2, we aim to enable TSA to find the information that TSWA has lost and further refine it. Extracting lost information from incomplete information is not a good idea. Thus we connect them in parallel instead of in series. Q2.2: TSA only. A2.2: TSA-only achieved worse results than our proposed method. We didn’t put it in the results because it deviates from the original intention of this paper for proposing an explainable framework and does not fit the motivation of our method. If TSA-only, we cannot get either sequence contribution weights or TSEM, which are two key contributions of this paper. Moreover, TSA-only makes the training more difficult, as TSF-Seq2Seq can inherit the performance of Seq2Seq ideally using a zero initialization for TSWA and TSA, but TSA-only cannot.

Q3: Statistical comparisons. (R1) A3: The TSF-Seq2Seq and baseline have a significant difference at PSNR (p=0.004), SSIM (p<0.001), and LPIPS (p<0.001) using paired sample T-Test. In the revision, we will indicate the p-value for all the comparisons.

Q4: Missing/unbalanced data, external evaluation. (R1, R3) A4: The model has considered any input combinations, and additional adversarial learning can be used to solve data problems. We understand the importance of external evaluation and the impact of data constitution, and we’ll be exploring these topics more in-depth in our journal work.

Q5: Potential in other tasks. (R1) A5: We have demonstrated the potential of the proposed method on segmentation tasks in the submitted supplementary material. In our journal work, we will explore more applications for segmentation and classification tasks.

Q6: More details. (R1, R4) A6: The revision will add more details about Seq2Seq and the training settings.

Q7: Visualization. (R3, R4) A7: We thank the kind suggestions. Due to space limitations, we cannot provide more visualization of synthesized images and TSEM in the revision, but we will deeply explore them in our journal work.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors introduce a novel approach for synthesizing MRI modalities, aiming to enhance interpretability. Their proposed method are two fold: (1) enabling quantification of the contribution of each input sequence in the fusion stage using a trainable task-specific weighted average module, and (2) highlighting the specific area that the network aims to refine during the synthesis process through a task-specific attention module. All three reviewers are satisfied with current status of the manuscript.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper is interesting, clinically-relevant and well-written. The authors addressed the reviewer major concerns with statistical tests and clarifications. They also provided the required information about the motivation of the task-specific attention.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This work proposes a explainable task-specific synthesis network via a task-specific code for missing-sequence MRI synthesis. The concerns on the explainability raised by the reviewers have been well-solved. However, I do not think the motivation concern raised by the AC was solved. I expect a more deep discussion on the missing modality and what the limits of the model on each potential targeted missing modality. I do not think the task-specific module is the major contribution of this work, which is similar to the idea employed in increasing the imaging modalities variations for domain generalization, such as Wenguang Yuan et al. 2020 MedIA. Nevertheless, I think the work is OK for accept.



back to top