Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Matteo Ronchetti, Wolfgang Wein, Nassir Navab, Oliver Zettinig, Raphael Prevost

Abstract

Multimodal image registration is a challenging but essential step for numerous image-guided procedures. Most registration algorithms rely on the computation of complex, frequently non-differentiable similarity metrics to deal with the appearance discrepancy of anatomical structures between imaging modalities. Recent Machine Learning based approaches are limited to specific anatomy-modality combinations and do not generalize to new settings. We propose a generic framework for creating expressive cross-modal descriptors that enable fast deformable global registration.
We achieve this by approximating existing metrics with a dot-product in the feature space of a small convolutional neural network (CNN) which is inherently differentiable can be trained without registered data. Our method is several orders of magnitude faster than local patch-based metrics and can be directly applied in clinical settings by replacing the similarity measure with the proposed one. Experiments on three different datasets demonstrate that our approach generalizes well beyond the training data, yielding a broad capture range even on unseen anatomies and modality pairs, without the need for specialized retraining. We make our training code and data publicly available.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43999-5_72

SharedIt: https://rdcu.be/dnwxp

Link to the code repository

https://github.com/ImFusionGmbH/DISA-universal-multimodal-registration

Link to the dataset(s)

https://doi.org/10.5281/zenodo.583096

Reviews

Review #1

Please describe the contribution of the paper

In this paper, the authors propose a novel method to learn similarity metrics for multi-modal image registration. The key idea is to register the images by comparing the features extracted from the registered images by a small CNN (see Eq. (2)).
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The method is simple and pragmatic. It is shown to be fast, and to work well on the Learn2Reg challenge 2021.

The paper is well motivated.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Some parts of the methodology are not clear to me.

I find some results suspicious. Maybe the methods compared to the proposed one were not properly optimized.

I will develop these points in my comments.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Methodology description is too broad to reproduce the results.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

It is not clear to me whether T_{\alpha} and \phi are trained simultaneously or not. Formalising more clearly the training procedure would be a plus for the paper.

Why using the same CNN for F and M in Eq. (2)? The features to extract are not necessarily the same in two multi-modal images.

In Fig. 2, I would expect the results obtained using the proposed method to be slightly less accurate (but obtained far more quickly) than those obtained using for LC^2. The proposed method indeed approximates this metric, if I understand well the paper. May the authors explain this phenomenon?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

3
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This method could be interesting, but the paper deserves more discussions and a more clearly described methodology.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

I thank the authors for their rebuttal, which convincingly addressed my main concerns about this paper. I therefore changed my recommendation about this paper. I hope that the authors will however develop the clarity and the discussions of the paper, if it is accepted.

Review #2

Please describe the contribution of the paper

This paper presents a deep learning-based similarity metric for multi-modal image registration. A CNN is trained to encode images with different modalities into a common feature space. And the dot product of two feature vectors is used as patch similarity which approximates the LC2 measurement. The method offers speed gain over conventional LC2 computation and can thereby improve registration accuracy through global optimization with more initializations.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed approach to simulate similarity metric via siamese network and dot product is, as far as I know, novel in the field of registration. It is also straightforward and readily applicable to different modalities. The method has been evaluated on various modality pairs and anatomical sites.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

There is no discussion on previous works that have tried to learn similarity metrics. E.g., Haskins, Grant, et al. “Learning deep similarity metric for 3D MR–TRUS image registration.” and more in Fu, Yabo, et al. “Deep learning in medical image registration: a review.” This makes it difficult to assess the novelty and weaknesses compared to previous methods.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

A part of the data used in the paper is publicly available (provided by others, not the authors), but both training and evaluation also involve proprietary datasets. No code is available. Architecture and training are described adequately.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

● Some key related works in the area of metric learning and registration are missing in the introduction/discussion. ● Registration is an unsupervised process in nature. And ground-truth DVFs are not required in many unsupervised DIR methods. In terms of approximating a metric, the proposed method is supervised in the sense that it requires ground-truth LC2 values. The authors should clarify their statements accordingly. ● The method does not use any normalization layer, but what about image pre-processing? Does the image/patch intensity need to be rescaled? ● Hausdorff distance is often reported in its 95th percentile to exclude outlier pixels. But I do not see the need or reason to do it for Dice. Is the percentile case-wise or label/organ-wise? A more reasonable way might be to report the average DSC (and std) on different organs/structures separately. ● The absence of LC2 registration in Sec. 4.2 should be explained. ● The authors claim that the method can be used for all types of images and does not need to be retrained for a new task. But the three testing modalities are all present in the training set. Only a certain combination is unseen. Please rephrase the statement. It would be interesting to test the model’s effectiveness on CBCT/PET scans. ● It would also be interesting to compare the method against supervised end-to-end DIR in future work (i.e., LC2 similarity objective v.s. LC2-derived DVF objective). ● Please check the grammar and wording again. E.g., “computationally expensive to compute”.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The idea is interesting and the method is well thought through. However, discussions on some important previous works would strengthen the paper.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

I still think the idea of learning a metric is interesting and novel. But the authors’ feedback on generalizability and sensitivity is not well supported by evidence, at least from the current submission. Therefore I do not think it is fair to consider this point as a strength/innovation over existing approaches.

Review #3

Please describe the contribution of the paper

The paper introduces a DL-based variant of LC^2 image similarity as the dot product between multimodality descriptors extracted using a lightweight fully convolutional neural network and demonstrate its ability global and local optimization in US-CT, US-MR, CT-MR registration. The proposed metric was trained with unregistered data and demonstrated ability to generalize to unseen data, yielding comparable performance to the classic LC^2 metric.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper introduces an original approach to approximate an LC^2 similarity metric using a dot product similarity between CNN-extracted features from two images to be registered. Although the idea of estimating intermediate modality-invariant (scalar- or vector valued) images from two images acquired using different imaging modalities is not new, a fast DL approximation of a LC^2 metric could benefit real-time multimodality registration. Additionally, the approach yielded comparable performance to LC^2 in US-MR and US-CT deformable registration.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The idea of converting multimodality images into a common (modality-invariant) space is not novel since there are learnable and non-learnable prior work [1,2]. Similarly, computing similarity based on the dot product is not new, for example normalized cross correlation [3]. In order for the work to be useful in most real clinical settings, estimated transformations should be able to map points in defined in orginal image spaces described by image origin, direction, and spacing as encoded in dicom. This work, however, could not support such mappings.

[1] Z. Jiang, et al. Modality-Invariant Representation for Infrared and Visible Image Registration, arXiv:2304.05646, 2023 [2] M. P. Heinrich, et al. MIND: Modality independent neighbourhood descriptor for multi-modal deformable registration, Medical Image Analysis, 16(7):1423-1435, 2012 [3] A.A. Goshtasby. Similarity and Dissimilarity Measures. In: Image Registration. Advances in Computer Vision and Pattern Recognition. Springer, London. 2012
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproductivity of the work is low since description of methodology, including CNN architecture, loss function, preprocessing and training data requirement, is unclear. For example, Eq. 2 and the CNN architecture described do not guarantee that a CNN feature has a unit norm and that Eq. 2 ranges [0,1], but training patches were sampled based on the similarity values ranging [0,1]. For the proposed DISA-LC^2, it is unclear whether the weighting function is based on the local patch variance of a fixed image or a moving image and whether it has to be recomputed for each different pair of moving/fixed images. Also, the paper describes that CNN was trained using unregistered data. Were unregistered data acquired from the same patients (intra-subject pairs) or from different patients (inter-subject pairs)?
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The paper proposes a DL-based variant of LC^2 image similarity as the dot product between multimodality descriptors extracted using a light fully convolutional neural network (CNN) and demonstrate its ability to yield comparable performance in US-CT, US-MR, CT-MR registration to the classic LC^2 metric.

Although formulating LC^2 as a learnable function is original, the methodology is not clearly described. For example, is Eq. 2 bounded below and above? If so, Eq. 2 and/or its description needs modification since it appears unbounded. The explanation of experiments and evaluations is, likewise, needed to be improved. Below lists comments on points that should be addressed to improve readability.
-Since metrics do not necessarily reflect physically plausible registration outcomes, for each experiment, the paper should include at least a figure showing images after registration. -Eq. 2 and CNN architecture do not guarantee that a CNN feature has a unit norm and that Eq. 2 ranges [0,1], but patches were sampled based on similarity ranging [0,1].
-It is unclear whether DISA-LC^2 was computed in the similar manner as LC^2 as the average over different radiuses. If so, how did it achieve that. The paper should include the diagram showing DISA-LC^2 CNN architecture. -For DISA-LC^2, it is unclear whether the weighting function is based on the local patch variance of a fixed image or a moving image and whether it has to be recomputed for different moving/fixed images. -For a fair comparison, the paper should describe the parameter settings of each approach MIND-SSC, LC^2 and DISA-LC^2. -For each experiment (especially, ones described in Sections 4.2, and 4.3), if rigid/affine registration was used to provide initialization, its registration performance should be reported along side the outcome from the proposed method. For example, Table 2 should include DSC and HD of rigid/affine initialization, if used.
-Since the main aim of the work is fast global optimization in multimodality registration, the paper should report the runtime of registration including global multi-start (if used) and local gradient-based optimization for each experiment. -For the experiment described in Section 4.3, what was the ground truth and how was it defined? What were the 2 deformation parameters? Please describe the deformation as a function applied to a point in an equation to clarify the deformation function. -For Fig. 2, please put either mean and standard deviation or median and interquartile range next to each box plot to help understand the performance of each approach.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The methodology is not new, CNN extracting image features and compute a cosine distance between the features. The method description is not clear enough to reproduce the work. Since the work is image registration and quantitative evaluation is not sufficient to demonstrate the physical plausibility of the estimated transformation, figures of images after registration should be included in the paper.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

3
[Post rebuttal] Please justify your decision

There are 2 equations in the manuscript which are the keys of the work. Reviewer# 3 asked questions regarding their ranges and weights used in the equations but feedback does not provide any answers or clarifications.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper received a mixed review of positive and negative feedback. While all reviewers found the concept of learning similarity metrics for multi-modal image registration interesting, several major concerns were raised. These concerns encompassed the limited novelty of the proposed model in comparison to prior works and the absence of adequately convincing experimental results due to missing details and unexpected or suspicious outcomes. Overall, the quality of the current manuscript is not ready to publish at MICCAI.

Author Feedback

In light of the reviews we have received, we acknowledge that we must sharpen the manuscript to highlight the novelty and the superiority of our work over existing approaches, and are confident that this will lead to a strong final version. We would like to emphasize that approximating a complex multi-modal similarity metric with a small neural network is indeed novel, and the insight that this works so well could have enormous implications for image registration in clinical routine.

First of all, to clarify our architecture and training procedure, we make both the code and the preprocessed data public already at this stage: github.com/miccai3033/3033 (will be moved after acceptance).

Reviewer #1 was concerned about the plausibility of our results, finding them “suspicious”. As this claim was not particularly substantiated but did cast some doubt on our scientific integrity, we would like to strongly repudiate it. We made significant efforts to tune the baseline approaches, and are confident in our implementations since our results for MIND and LC2 in experiments 4.1 and 4.2 resemble the ones submitted by the respective authors in the Learn2Reg challenges closely. Furthermore, for experiment 4.3, we contacted the authors of LC2 to make sure that our implementation was correct. Generally speaking, there could be multiple reasons for DISA performing better than LC2: differentiability allows the use of a better derivative-based optimizer and the inductive bias of the CNN is probably making the registration objective function smoother. While this will be discussed in the final version of this paper, a more thorough investigation is to be referred to future work. In the meantime, we will include figures of the volumes after registration as additional material.

Our method does not try to approximate the registration as a whole, but rather only substitute the similarity measure, independently of the used transformation and/or deformation model. This work can therefore be directly applied in any clinical registration setting where patch based similarity measures (for ex. LC2, MI, NCC) or local descriptors (like MIND) are being used. We can already disclose that we have repeated the experiments from the paper, but this time completely excluding ultrasound images from the training data. We obtained similar results, which shows that our method generalizes well across modalities and anatomies. Our approach is quite generic and not sensitive to choices of architecture or datasets. Therefore, we are confident it can have an impact on a large number of registration problems.

While local handcrafted descriptors, such as MIND, have already been used for multimodal registration, the idea of learning from a more complex similarity metric is novel. Furthermore we show that DISA is clearly superior to MIND when dealing with ultrasound data. Other works learn a similarity metric using a CNN (for example Haskins et al. 2019, Sedghi et al. 2018). Unlike existing literature, our method doesn’t require ground truth registration to be trained, does not require to evaluate the CNN at every iteration (therefore is significantly faster), extracts local features that can be used as part of existing registration approaches (see experiment 4.2), and most importantly is not limited to a single anatomy+modality combination but generalizes far beyond its training data. The latter point differentiates our work from many existing papers, for instance Guorong Wu et al. 2017.

The above considerations will be addressed in the manuscript. In addition, we expanded the references to existing works as suggested and added as much details to the experiments as possible given the space constraints. We thank the reviewers for the constructive feedback and the opportunity to clarify some of the misunderstandings and improve our paper.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The average score for this paper remains below the acceptance threshold after the authors’ rebuttal. While the authors’ response clarified questions mostly from R1 & R2 and they find the concept of learning similarity metrics for multi-modal image registration interesting, R3 highlights that the core idea of computing cosine distances between CNN-extracted features lacks novelty. Moreover, the current manuscript lacks sufficiently convincing results to demonstrate better generalizability and sensitivity compared to the state-of-the-art, as also emphasized by R2.

It is evident that the authors need to significantly improve the paper’s clarity and provide insightful discussions for readers to understand and follow the core methodological developments.

In conclusion, the current manuscript is not ready to publish at MICCAI. This meta reviewer recommends that the authors address all the questions raised by the reviewers in future submissions, while substantially improving the paper clarity and organization.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Overall, the reviewers have regressed to the mean, feeling borderline about the paper and slightly leaning towards accept.

Overall, it seems that the major concerns were addressed, although the reviewers still emphasize the need to clean up the paper by the camera ready deadline. Reviewer 2 in particular has serious concerns about the explanation of the equations that drive the method.

Overall, I agree that there is quite a bit of clarity that needs to be improved, and the paper is borderline. It may have just enough to merit discussion at the conference.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

I have read the comments and rebuttal. This paper is about learning/approximating an LC^2 similarity metric for image registration. Parts of the raised concerns have been addressed, e.g., reasons for DISA performing better than LC^2, method implementation, and method generalization. Publication of codes and preprocessed data will help improve paper reproducibility.

back to top

DISA: DIfferentiable Similarity Approximation for Universal Multimodal Registration