Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Kangrong Xu, Qirui Huang, Xuan Yang

Abstract

The variational registration model takes advantage of explaining uncertainties of registration results. However, most existing variational registration models are based on convolutional neural networks (CNNs), which cannot capture distant information in images. Besides, the evidence lower bound (ELBO) and the commonly used standard prior cannot close the gap between the real posterior and the variational posterior in the vanilla variational registration model. This paper proposes a network in a variational image registration model for cardiac motion estimation to effectively capture the spatial correspondence of long-distance images and solve the shortcomings of CNNs. Our proposed network comprises a Transformer with a T2T module and the cross attention between the moving and the fixed images. To close the gap between the real posterior and the variational posterior, the importance-weighted evidence lower bound (iwELBO) is introduced into the variational registration model with an implicit prior. The coefficients of a parametric transformation using multi-supports CSRBFs are latent variables in our variational registration model, which improve registration accuracy significantly. Experimental results show that the proposed method outperforms state-of-arts research on public cardiac datasets.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43999-5_55

SharedIt: https://rdcu.be/dnww8

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a transformer based registration network that aims to overcome challenges of long-range deformations, and further combines importance-weighted ELBO and aggregated posterior to close the gap between real and variational posterior. It enforces a sparse regularization constraint via the coefficients of multi compact support radial basis functions (CSRBF).

    This is applied to 4 public cardiac MR datasets for registration end systole/diastole, evaluated using segmentation metrics and compared against several learning based registration methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The key innovation is in using a framework that involves Transformers encoders and decoders, iwELBO and CSRBFs.

    Results, on public data, indicate a good if not superior performance to the state of the art.

    A nice comparison/ablation study of the influence of different parts on one of the public datasets (ACDC) is provided.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Major:

    The paper starts off mixing classic variational registration and learning-based variational registration in the background secton. “most existing variational registration models are based on convolutional neural networks” – which is technically not true, as “classic” variational registration has been around for decades.

    I find missing any reference to the work by de Vos on deep learning-based registration using B-splines (MedIA 2019) which due to its parameterization ad multi-scale/level set-up could be quite a relevant competing method.

    On the motivation of using transformers, I would not call end-diastolic/end-systolic spatial deformation as particularly “long distance” particularly when using a multi-resolution setting which has not been explored or compared to here. However, in this setting, transformers operate on image patches, which make this problem artificially long-range, which perhaps could be better explained.

    The results lack significance testing. The results replacing components with ViTs are less conclusive. It cannot be readily deduced from Table 3, that “cross-attention outperforms self-attention”.

    Minor:

    Please revise references which are inconsistently capitalized.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Use of public datasets. While numbers of training/test splits are reported, given these are public datasets it would be helpful to refer to the actual cases.

    Results are quantitatively reported, with mean/stdev, and metric referred to, bur there is, in contrary what the authors state in their checklist, no “An analysis of statistical significance of reported differences in performance between methods.”

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    This work shows promising results for a new transformer based method which however are not statistically analysed. The approach itself is not well motivated, e.g. why use patches an make this method artificially long-range when the application itself could be solved on whole images? Methods this is compared against confirm to the same level, and sometimes better, so it is not clear what advantage this method would offer over others. The comparison methods are not representative of the relevant state of the art - e.g. VoxelMorph has evolved and provides much better versions since [4]; other RBF based methods exist (de Vos MedIa 2017 as mentioned). Having said that, some to of the ideas developed are quite interesting, just have not been placed into appropriate context.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The major factors leading me to score for weak acceptance are the validation which lack significance testing, and lack of comparison against more similar methods (e.g. de Vos for RBF DL based registration, or other variational methods, classic or DL based), but the work presents some interesting new methodological ideas that may be of interest to the community.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    While we all had some minor issues with this work, we all consistently chose “weak accept” so I really don’t understand why this work was not an early accept.

    I strongly disagree with not following the MICCAI review process (https://conferences.miccai.org/2023/en/THE-MICCAI-REVIEW-PROCESS.html) which clearly states:

    Stage 8: Early Paper Decisions and Rebuttal Process The Area Chairs will provide a ranking of the papers they handle as Primary AC, identify borderline papers for rebuttal, and recommend early acceptance or rejection of papers based on consistent reviews and scores.

    I will therefore upgrade this to “accept” and hope this now comes through as a matter of principle.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a new variational image registration model with improved evidence lower bound and a transformer network for 2D cardiac registration. The proposed importance-weighted evidence lower bound close the gap between the real and variational posterior by avoiding over-regularization on the posterior. The transformation model parameterized with multi-supports compact support radial basis function (CSRBF) imposes a sparse constraint to the transformation and is capable of regularizing the smoothness of the displacement vector field. Experiments on four public cardiac datasets show that the proposed method outperformance three learning-based variational registration methods in terms of Dice score, the Hausdorff Distance (HD), and average perpendicular distance (APD), while maintaining the desired diffeomorphic properties of the solutions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Learning-based variational image registration is an essential yet less explored problem. Developing a variational image registration method for cardiac registration is well-motivated as it can provide registration uncertainties apart from the displacement vector field used to align images.

    The proposed method achieves superior registration performance over three learning-based variational image registration methods. Multiple metrics such as Dice score, HD, APD and bending energy are used to comprehensively quantify the registration performance of the proposed method and competitive methods.

    A complete ablation study on the iwELBO is provided, showing the effectiveness of each component/modification.

    The writing is clear, and the main components of the method, i.e., iwELBO, aggregated posterior and transformer network, are well-justified with explanation and formulation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The performance gain of the proposed iwELBO objective function over the other learning-based variational image registration framework, e.g., the probabilistic formulation in DalcaDiff, is unclear and ambiguous. Since the proposed method uses a different network architecture than the competitive methods, i.e., DalcaDiff, VoxelMorph and NetGI, it is unclear what the individual performance gain of the proposed network architecture and the iwELBO formulation is. The performance gain of the iwELBO, regardless of the network architecture, is crucial as it is the main contribution of the manuscript.

    The proposed network architecture is not particularly novel. The transformer architectures have been well-established in the literature on image registration [1, 2]. Hence, the ablation study of the proposed network architecture compared to the ViT is less insightful. Instead, it will be interesting to transfer the proposed learning paradigm to a regular CNN architecture, demonstrating the generalizability of the proposed method.

    Lack of analysis of the registration uncertainty. As mentioned in the manuscript, registration uncertainty is one of the main advantages of the probabilistic generative registration model. Yet, no concrete example or analysis demonstrates this advantage of the proposed method. How to compute the registration uncertainties in the proposed method? What are the potential applications of registration uncertainties in cardiac image processing?

    The proposed method only exemplified on 2D registration. It would be great if there is an example of 3D registration, e.g., brain MR registration, using the proposed method.

    Minor:

    “Compact support radial basis function” should be defined in the “introduction” section. “VAE” is not defined throughout the paper. Typo – Page 2, “object function” -> “objective function” Typo – Page 8, “cascaded” -> “concatenated”

    Reference

    [1] Chen, Junyu, et al. “Transmorph: Transformer for unsupervised medical image registration.” MIA2022. [2] Shi, Jiacheng, et al. “Xmorpher: Full transformer for deformable medical image registration via cross attention.” MICCAI2022.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Fair reproducibility. Most of the technical details of the proposed method have been provided. Yet, the authors stated that the source code of the proposed method will be made publicly available following the acceptance of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    When comparing the proposed learning paradigm to existing learning-based variational image registration methods, applying the proposed iwELBO method to the same network architecture of the competitive method, e.g., Diff-VM, will be a good start to demonstrate the effectiveness of the proposed iwELBO formulation.

    The paper will benefit from adding visualization of the deformation field as well as an in-depth analysis of the registration uncertainty.

    In Table 1, showing the initial result of unregistered images, can give a sense of the difficulties/complexity of the task.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the proposed iwELBO used in variational image registration is novel, and the combination of the variational image registration method with the transformer is a well-grounded effort. Yet, the paper is held back by the incomplete evaluation of the iwELNO method and lacks of analysis in registration uncertainty and generalizability of the proposed method.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The rebuttal does not fully address my concerns. A7 explains why the proposed method was only evaluated with 2D registration task, but the potential of the proposed method to be applied to 3D registration is not discussed. Moreover, one of the claimed advantages of the proposed method is to provide registration uncertainty, but the response in A5 claimed that it does not provide any additional information in this task, contradicting the statement in the paper, and the applicability of the registration uncertainty is doubtful.

    Overall, this paper is interesting and has merits slightly weigh over weaknesses. As such, I will keep my score unchanged, i.e., “Weak accept”. As some details of the method are missing, I suggest releasing the training and evaluation code, as promised in the paper, following the acceptance of the paper.



Review #3

  • Please describe the contribution of the paper

    This paper introduces a novel approach to learning-based variational image registration, which leverages the principles of importance-weighted autoencoders, implicit optimal priors, and cross-attention transformers. The method exhibits good performance when applied to 2D datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -The proposed method enhances the previously introduced NGI method [1] by employing a novel Transformer network architecture, replacing ELBO with iwELBO, and introducing an implicit optimal prior derived from the posterior of data. -The method demonstrated good performance on several publicly available datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -The implementation description lacks certain specific details, such as the number of control points (n) for the CSRBFs, the sample size for the Monte Carlo estimator (K), the regularization parameter (lambda), and the quantity of Transformer blocks (N). The values for these parameters were not provided. -While the proposed method outperforms the baseline approaches, the impact of the introduced modules appears to be limited. Specifically, as shown in Table 2 and 3, the differences between iwELBO and ELBO, between using ViT and the proposed Transformer, as well as between using the aggregated prior and not using it, are relatively minor. -How was the training process for the discriminator (T(z)) carried out, and what are the components that make up the discriminator? -The proposed method works on 2D problems. Nevertheless, cardiac registration tasks, including motion estimation, typically involve 3D problems, indicating a potential limitation in addressing such applications.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the presented method is poor due to the omission of crucial information about hyperparameter choices.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    -Please share information on the selection of hyperparameters, including those previously mentioned and any others not yet mentioned, such as the number of training epochs and so on. -Please perform statistical tests on the quantitative experiments presented in Tables 1, 2, and 3 to validate the results.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the paper could be improved by providing additional details, the introduced method is innovative and demonstrates good performance on the evaluated 2D datasets. If the authors carefully revise the paper, I would recommend its acceptance.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper is borderline. Although it received three ‘weak accepts’, all reviewers shared important concerns and were all close to bordeline rating. I think the major issues need addressing in the rebuttal.

    Most reviewers mention the need to justify the motivation better adn better related to existing work. For example, R2 gives quite a few details of relationships to existing literature. Other reviewers note the lack of proper comparison.

    Several reviewers mention the lack of proper statistical significance and explaining how hyperparameters were chosen, and whether there is a separate held-out set (not validation set).

    My overall impression is that the reviewers are slightly leaning towards weak accept because the method seems interesting, but the clarity of the paper and the overall execution should be improved. I invite the authors to address this in a rigorous and careful rebuttal.




Author Feedback

Q1: Implementation description lacks details.(R3) A1: The datasets and control points are similar to that in NetGI[13]. 64 global control points are evenly spaced on the 128 × 128 image, while 100 local control points are evenly spaced in an area of 64 × 64 in the center of the image that contains the heart. The sample size for the Monte Carlo sampling is K=5; Lambda is 110000; the number of Transformer blocks N=3.

Q2: Details about the discriminator?(R3) A2: We optimize our network by iterating a two-step procedure. The encoder is updated using Eq.7 by fixing the discriminator. Next, the discriminator is updated using Eq.6 by fixing the encoder. Above two steps are performed alternatively. The discriminator is a network composed of four fully connected layers with a dropout layer. Because the discriminator is simple, we didn’t provide its details.

Q3: The approach itself is not well motivated. (R1) A3: The Transformer is an architecture over patches of the image that deals with this issue by computing the similarity between image patches. An image is split into fixed-size patches, each is then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to the Transformer to extract image features. Ingeniously, each patch in Transformer corresponds exactly to one control point used in our transformation model. This one-to-one correspondence is more conducive to network modeling. We stated above in the last paragraph on page 4 of the manuscript.

Q4: Comparison methods are not representative of the relevant state of the art.(R1) A4: We compared the state-of-art unsupervised learning registration networks and VAE-based registration networks in Table 1. Especially, VoxelMorph [4] is the most famous unsupervised learning registration network; TransMorph is a state-of-art registration network structure based on a swing-Transformer performed on the brain dataset; KrebsDiff [17], DalcaDiff [11], NetGI [13] are VAE-based registration models. Overall, almost representative works are compared with our work in Table 1 of the manuscript.

Q5: Whether there is a separate held-out set and lacking of proper statistical significance(Meta-Reviewer, R1, R2, R3) A5: The dataset is explained at the beginning of the experimental section in the manuscript. Each dataset contains the training, validation, and testing datasets. We agree with the reviewers’ suggestions on statistical significance. We will add the boxplot of the comparative experiment to the revised paper.

Q5. Lack of analysis of the registration uncertainty. (R2) A5: Since the control points are similar to that in NetGI [13], the uncertainty estimation of DVFs is similar to conclusions in NetGI. Considering that no additional information on uncertainty can be provided, we did not provide uncertainty analysis in the manuscript.

Q6: Effectiveness of iwELBO by applying it to other networks.(R2) A6: IwELBO can be applied to other VAE-based models, such as KrebsDiff [17] and DalcaDiff [11]. The performance comparison of iwELBO and ELBO is provided in the fourth and sixth rows in Table 2, which can validate the improvement of iwELBO. Besides, iwELBO is only one contribution of our work; the multi-supports transformation model and the aggregated posterior as prior are other contributions. Another issue needed to explain is that our transformation model is based on control points, while KrebsDiff [17], DalcaDiff [11], VoxelMorph [4], and Transmorph [9] are all dense DVF-based models. Therefore, Eq.7 cannot be directly used in the above models.

Q7: The proposed method only exemplified on 2D registration. (R2, R3) A7: The resolutions of our dataset along the short axis are around 1mm, and that along the long axis are 5 to 13mm. Since the resolutions along the long axis are too coarse, it is challenging to register in 3D. Therefore, most existing cardiac registration of short-axis cine-MR images is performed in 2D.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper was originally borderline, but leaning towards accept.

    The rebuttal has helped clarify some concerns, and as R2 emphasizes, some of the promises should make their way to the camera ready.

    Overall, the contributions outweigh the concerns, and the paper should be discussed at MICCAI.

    One final note – one reviewer emphasized that a paper with three WA should be automatically accepted as per MICCAI rules. This year, the instructions clarified that this is up to the discretion of the meta-reviewer, and I thought a rebuttal is warranted given the borderline reviews. I believe the rebuttal has strengthened the paper and appreciate everyone’s effots.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a variation registration framework using transformers for capturing long range deformations in cardiac MRI. Furthermore, they introduce importance-weighted ELBO. All reviewers recommended weak accept in the original reviews. Their concerns about related work, baseline comparisons and training details were addressed in the rebuttal. Therefore, I recommend acceptance.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    All reviewers consistently recommended (weak) acceptance of this paper. The authors adequately addressed the major questions raised by the reviewers (which were also summarized by MR #1).



back to top