Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Gašper Podobnik, Primož Strojan, Primož Peterlin, Bulat Ibragimov, Tomaž Vrtovec

Abstract

Radiotherapy (RT) is a standard treatment modality for head and neck (HaN) cancer that requires accurate segmentation of target volumes and nearby healthy organs-at-risk (OARs) to optimize radiation dose distribution. However, computed tomography (CT) imaging has low image contrast for soft tissues, making accurate segmentation of soft tissue OARs challenging. Therefore, magnetic resonance (MR) imaging has been recommended to enhance the segmentation of soft tissue OARs in the HaN region. Based on our two empirical observations that deformable registration of CT and MR images of the same patient is inherently imperfect and that concatenating such images at the input layer of a deep learning network cannot optimally exploit the information provided by the MR modality, we propose a novel modality fusion module (MFM) that learns to spatially align MR-based feature maps before fusing them with CT-based feature maps. The proposed MFM can be easily implemented into any existing multimodal backbone network. Our implementation within the nnU-Net framework shows promising results on a dataset of CT and MR image pairs from the same patients. Furthermore, the evaluation on a clinically realistic scenario with the missing MR modality shows that MFM outperforms other state-of-the-art multimodal approaches.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_71

SharedIt: https://rdcu.be/dnwEo

Link to the code repository

N/A

Link to the dataset(s)

https://doi.org/10.5281/zenodo.7442914

https://www.imagenglab.com/newsite/pddca/

Reviews

Review #1

Please describe the contribution of the paper

The authors address CT/MR multi-modal segmentation of head and neck organs-at-risk. A novel modality fusion module (MFM) is proposed and employed within the nnU-Net framework. MFM spatially aligns features from the auxiliary modality (MR) to feature maps from the primary modality (CT) through spatial transformer networks (STN). The proposed methodology is evaluated on a private dataset of CT and MR images and a publicly-available dataset of CT scans acquired for radiotherapy planning purposes. Results show that MFM outperforms state-of-the-art multi-modal approaches on a clinically realistic scenario.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Innovative and well-motivated methodology
- Relevant use of STN for multi-modal feature map registration
- Well-conducted assessment using Dice and Hausdorff distance
- Encouraging segmentation results on a clinically realistic scenario
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- How to use the trained network for inference purposes in the missing modality scenario could be more deeply explained
- Qualitative results could be integrated to visually observe the quality of the obtained delineations
- Some references on cross-modality learning are missing
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- Implementation details are disclosed in the paper. In particular, the architecture based on nnUNet can be easily reproduced.
- Apart the PDDCA dataset which is publicly-available, either the code nor the data will be disclosed.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
The submission is well-written and of high interest for the medical image analysis community. The methodology is innovative and well-motivated. Comments provided below should be taken into account for further improvements.

Major comments:
- The motivation part of Sect.1 could mention works on cross-modality learning from the medical image analysis field (i.e. not only computer vision) such as Q. Dou et al. “Unpaired multi-modal segmentation via knowledge distillation,” IEEE Transactions on Medical Imaging, 2020, G. Andrade-Miranda et al. “Pure versus hybrid Transformers for multi-modal brain tumor segmentation: a comparative study” ICIP, 2022 and [21].
- The methodological part (Sect.2) should discuss the scenario of missing modality. In practice, you could explain how to use the trained network for inference purposes when a given modality (MR in your case) is missing.
- Among the baselines used for comparison purposes (Sect.2), you use a nnU-Net trained on concatenated CT and MR image pairs. Are the images registered? By the way, it should be mentioned.
- Is missing in the submitted paper qualitative results which could enable to visually observe the quality of the obtained delineations. This could at least be included in the supplementary materials.
Minor comments:
- The sentence “While it was demonstrated that complete spatial invariance cannot be achieved with STNs” in Sect.2 could be further detailed.
- Spatial transformer networks (STN) are used to align feature maps from the auxiliary modality (MR) to feature maps from the primary modality (CT). As you mentioned, STN was previously proposed in [3] to resample and align low-resolution feature maps to high-resolution features map. Linked to what you have done, you could cite and position your contribution with respect to Yan et al. “Longitudinal detection of diabetic retinopathy early severity grade changes using deep learning”, MICCAI OMIA, 2021 whose intermediate fusion scheme employs STN to align feature maps from different time points in a mono-modal and longitudinal setting.
- STN regresses 12 affine 3D transformation parameters to perform the registration task. Is a rigid alignment sufficient? An alternative could be to make use of strategies such as VoxelMorph to infer a non-rigid alignment between features maps from both auxiliary and primary modalities.
- You state that MAML [19] has a considerably higher number of parameter. Please give the number of parameters of your pipeline and MAML to be able to objectively compare!
- Before Sect.4, you could explicitly mention what “” and “” means (used in Fig.2 and 3) in the text, by providing the corresponding thresholds. Only “” and “**” are explicitly mentioned.
- Infinite values of HD95 were replaced with a maximal value over all data. One can wonder if it makes sense to do this. An alternative could be to take into account the largest diagonal in the image.
- The primary modality in your study is CT. Have you tried to use MR as primary modality (and hence CT as auxiliary modality)?
- Perspectives dealing with more than 2 imaging modalities could be mentioned in the conclusion part.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- Innovative and well-motivated methodology
- Strong evaluation and encouraging results on a clinically realistic scenario
- Well-written paper, easy to follow
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

7
[Post rebuttal] Please justify your decision

Responses from authors are convincing. My overall opinion will remain the same as in the review period.

Review #2

Please describe the contribution of the paper

This study aimed to integrate the MR modality into the CT-based segmentation pipeline for HeadNeck Organs At Risk segmentation. The studied problem is of great interest to the MICCAI community and also is clinically relevant. While the conventional approach is to employ the multimodal modal image data as multi-channel inputs into segmentation networks, this study proposed a fusion strategy to address the geometrical misalignment between the two modalities as a result of erroneous image registrations. The proposed method is designed to deal with the missing modality. The performance of this method was evaluated on two small-size datasets and compared against relevant references.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well written, the problem was well described and a very good overview of the current literature and their limitations were provided. The methodological aspects of this work are well-considered when the authors conducted a fair comparison against relevant works.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The proposed method of this study was formed based on the assumption that including the MR in the CT-based H&N OARs segmentation pipeline would improve the accuracy of soft tissue delineations. Despite the efforts put into developing several modules to deal with the misalignment between the modalities, the quantitative results lie either in the same range as the CT-alone model or even were outperformed by the CT-alone model. In addition, more elaborations and justification about the functionality of the developed modules are required. Moreover, the performance of the model was evaluated on two small-size datasets (56+15 subjects).
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Considering the limited word count of the manuscript, the provided descriptions are satisfactory to replicate the model given the fact that the model was constructed within nnUNet framework. Sufficient details about the results quantification were provided.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

• The proposed model was built upon the modification applied to the encoder-decoder segmentation network (in this study nnUNet) by introducing three components: localization network, grid generator, and sampler. More elaborations should be provided regarding the functionality of the mentioned modules. In specific, details regarding the optimization of the regressor network are needed. It should be justified how the 12 outputs of the regressor are associated with the parameters of a 3D affine transformation, how their errors will be quantified, and how the model parameters are optimized. The same level of detail should be explained regarding the grid generator component. In this current version of the manuscript, it is not clear how the grid is generated from the output of the regressors. More importantly, how authors are assured that the output of the resampler unit is geometrically aligned with the feature maps from the other encoder? In general, regardless of the final segmentation accuracy, how do the authors justify the functionality of the proposed alignment strategy? • Having compared the segmentation accuracy of the proposed model against the conventional single modality CT images, it can be seen that adding the MR modality through the proposed method improved the segmentation accuracy from 0.761 to 0.767 (mean Dice score). In addition, the single modality model based on CT volumes performed much better than the proposed method when dealing with missing image modality (0.747 vs 0.678). Such quantitative comparison raises the question regarding the efficacy of the proposed method. In other words, including a second encoder along with 3 relatively complex components does not lead to the underlying hypothesis of the study (improving the segmentation accuracy by including the MR images in favor of soft tissue segmentations). How do authors justify these observations?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper does not provide enough evidence to justify how the proposed fusion modules functions. The readers are not convinced how well the fusion module could align the feature maps. The reported numerical results show that conventional CT-based segmentation still perform in either the same range or better than the proposed multimodal model.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

The most critical comment by this reviewer in the first round of review was related to the fact that despite the development of a rather complex solution to address the miss-alignment between CT and MR modalities, the quantitative results do not show real advantage of the proposed method. As the studied datasets were quite small, it is really hard to justify the observed slight segmentation improvement in certain cases as a result of the developed method. This fundamental problem was not address in the rebutal letter by authors.

Review #3

Please describe the contribution of the paper

The authors present a multimodal (CT+MRI) approach for the segmentation of the organs at risk in the head and neck area. The main contribution lays within a novel modality fusion module, which allows to combine feature maps of different modalities by simultaneously spatially aligning them. The evaluation shows the method to be on par with existing multimodal approaches in case of the availability of all input modes and to be significantly more stable in the case were the second modality (MRI) is missing.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The development of the paper, mainly the description of the field as well as the presentation of the method itself is really well done.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The presentation of the evaluation is hard to follow, especially in printed form all the figures are way to small. It would be easier to follow the discussion if simple box-plots or tables would have been used instead of the violin plots. Especially Figure 3 (left) is almost meaningless. The authors claim that the information within the MRI scans helps to outline soft tissue organs, however only averaged results are presented, to get a better insight into the presented approach it would be helpful the see individual organ evaluation results.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The used dataset is private, nonetheless an evaluation on a public dataset is done. The authors claim in their reproducibility response to provide access to the code which is helpful to get the parametrization of the nnUnet used for the training, as it is dataset dependent. Without the code and the paper only it would be hard to reproduce exact results.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The presented multimodal approach targets an important practical task, which is in current HaN OAR segmentation research is mostly not considered. Although, the presentation of the evaluation can be improved the authors present with their modality fusion module an interesting and novel solution for the task.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The contribution and the practical relevance of the research outweighs its weaknesses.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The key contribution of the work is to combine MRI information to improve segmentation on CT scans. This is an interesting concept and could help to improve the segmentation accuracy. The key idea is to use different patient scans (MRI and CT are unpaired) to access the information from MRI to use in CT using a spatial transformer network that aligns the images to be able to leverage different modality information. As reviewers point out, there are a few key concerns – first none of the results clearly show how the information from MRI helps to improve CT accuracy, such as for example how the features extracted from CT improved using the information from MRI. The methodology could be better explained as the details are somewhat lacking in how the information from the two modalities are fused. Some baseline experiments are shown. However, the authors may have overlooked other developments in this area, especially those using knowledge/cross modality distillation learning approach to solve the same problem for thoracic cancers (Jiang et.al who used unpaired cross modality distillation learning published in IEEE TMI 2021 (or 2022) as well as Dou et. al for cardiac and abdomen images - Med Image Analysis 2021). Authors might want to look at some of these works to clarify how their approach improves over others. Also, the fundamental premise of bringing different modality images into alignment to help with segmentation presupposes that first the anatomy of the patients is similar, which is not the case for cancer patients who may have very large tumors and enlarged lymph nodes that can push normal anatomy away, as well as the idea that registered images are sufficient, which has been shown by multiple works that image level information is less useful than intermediate or high level information for cross modality distillation. Fundamentally, MRI, which captures tissue magnetization is very different from CT which quantifies tissue attenuation. So, why would combining aligned images help with improved accuracy? If so, why not do alignment first and then directly apply into the nnUnet?

To address in rebuttal: Please provide a clear intuition for why the proposed approach using registration of MRI and CT images would work to improve accuracy on CT? Also, please clarify the results to show how the approach improves accuracy. The method details and discussion of the contribution with respect to several prior works in this area are missing. Please clarify how the work improves over others and what gap it addresses that others have not addressed. There are also concerns regarding the fundamental differences in the anatomy of different patient scans - how would the approach overcome that? Please discuss and clarify limitations of the approach. Finally, once the network is trained, does one need to supply CT and MRI images? This should be discussed as well.

Author Feedback

We would like to thank the reviewers (R1, R2, R3) and area chair (AR) for raising relevant questions and providing valuable feedback. We intend to revise our paper by especially focusing on:

[R1, R2, R3, AR] We recognize the ambiguity in our explanation of the preprocessing pipeline for input images. Our private dataset comprises 56 CT and MR image pairs obtained from the same patients. Prior to input into our model, each CT and MR image pair undergoes deformable registration using the SimpleElastix tool. However, despite reasonably successful registration, inherent misalignments of soft tissue organs persist due to low contrast, artifacts and noise. These observations led us to develop the Modality Fusion Module (MFM), which learns to align feature maps rather than images, effectively addressing potential registration errors and enhancing final segmentation. We refer to this alignment as “pseudo-registration” as it is not explicitly supervised but allows the network to adaptively transform feature maps for improved segmentations. Although not shown in the aggregated graphs, when MR modality is present, the proposed model improves segmentation of soft tissues, such as cervical esophagus, brainstem, lacrimal glands and lips by 2.9, 1.5, 2.4 and 1.8 DSC percentage points, respectively. The results for the missing modality scenario are significant from the perspective of comparing multimodal models. Unlike other multimodal baselines that invariably fail for most organs (except for mandible), our model demonstrates reasonable performance. This showcases that even though we have not used any augmentations to simulate missing modality scenario during training, our model is by design inherently more robust to such anomalies. We hypothesize that this is because the MFM module views the MR modality as an auxiliary one, aligning MR feature maps to CT feature maps while disregarding meaningless MR feature maps.

[R1, R2, AR] We thank reviewers for providing relevant references on (unpaired) cross-modality learning. As our focus is paired multimodal image segmentation, we believe that using image-to-image synthesis (Jiang et al., 2022) would add unnecessary complexity to our model. Admittedly, it would be interesting to compare our approach with the one proposed by Dou et al. 2020.

[R1] We appreciate the emphasis on the need for further clarification on the inference process. In case one modality is missing, we substitute an empty matrix (i.e. matrix of zeros) for the missing image. While alternative strategies exist, such as inserting a matrix of mean values computed over all training set images or using synthetic images (e.g. generated with CycleGANs), we contend that inputting a zeros matrix is a rational approach that does not degrade the feature maps derived from the other modality.

[R1, R3] If permitted, we intend to provide examples of qualitative results and per organ boxplots in supplementary materials.

[R2] We acknowledge the need for elaboration on the functionality of modules used in MFM. The Localization network comprises four Conv & ReLU layers, followed by a two-layer dense network, which regresses 12 scalars per training sample, corresponding to 3x4 parameters found in the affine spatial transformation matrix. Before training, we initialize the final layer of the localization network to produce an identity transformation matrix, aiding in model convergence. The Grid generator creates a sampling grid based on the affine transformation matrix and the Sampler is a differentiable block that interpolates input FMs based on the sampling grid and produces a new, aligned FM. Both blocks are implemented in the PyTorch library. They are differentiable and do not require any proprietary optimization, i.e. their parameters are updated by the gradients calculated from the segmentation loss.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors responded to the major concerns of the reviewers. Assuming the authors will update the paper’s discussion to clarify the differences of their approach from others and improve their explanation of the method as done in the rebuttal, the paper is recommended for acceptance.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This work proposes a multi-modality fusion strategy to encode CT and MR image data for segmenting head and neck tumours for radiotherapy. It involves a spatial transformer network to potentially bring image features from both modalities into alignment. While interesting as an idea, my main concern is the underwhelming evaluation results in the comparison with the CT only segmentation method, which is the baseline. The author’s rebuttal does not convince me why the added use of MR with segmentation masks is beneficial. Although it is mentioned that adding MR improves segmentation of other soft tissue structures, the clinical application of interest that was presented is radiotherapy, so tumour segmentation has to be as accurate as possible. There seems to be no improvement when introducing MRI. The fact that the proposed method is robust to missing MRI data is nice, however, it is also unnecessary to have MRI data, if CT alone already gives a (presumably, since clinical relevance of the achieved absolute performance is not discussed) reasonable result.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This work proposes an interesting idea of leveraging multimodal data (MRI and CT) to improve CT segmentation in radiotherapy. My major concern of this work is the lack of meaningful improvement by adding MRI image modality, especially in the CT only dataset, and lack of explanation on how MRI helps to improve CT segmentation.

back to top

Multimodal CT and MR Segmentation of Head and Neck Organs-at-Risk