Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jiacheng Shi, Yuting He, Youyong Kong, Jean-Louis Coatrieux, Huazhong Shu, Guanyu Yang, Shuo Li

Abstract

An effective backbone network is important to deep learning-based Deformable Medical Image Registration (DMIR), because it extracts and matches the features between two images to discover the mutual correspondence for fine registration. However, the existing deep networks focus on single image situation and are limited in registration task which is performed on paired images. Therefore, we advance a novel backbone network, XMorpher, for the effective corresponding feature representation in DMIR. 1) It proposes a novel full transformer architecture including dual parallel feature extraction networks which exchange information through cross attention, thus discovering multi-level semantic correspondence while extracting respective features gradually for final effective registration. 2) It advances the Cross Attention Transformer (CAT) blocks to establish the attention mechanism between images which is able to find the correspondence automatically and prompts the features to fuse efficiently in the network. 3) It constrains the attention computation between base windows and searching windows with different sizes, and thus focuses on the local transformation of deformable registration and enhances the computing efficiency at the same time. Without any bells and whistles, our XMorpher gives Voxelmorph 2.8% improvement on DSC , demonstrating its effective representation of the features from the paired images in DMIR. We believe that our XMorpher has great application potential in more paired medical images. Our XMorpher is open on https://github.com/Solemoon/XMorpher

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16446-0_21

SharedIt: https://rdcu.be/cVRS1

Link to the code repository

https://github.com/Solemoon/XMorpher

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed a full transformer architecture to extend the cross-attention transformer to establish the attention mechanism between images for the multi-level semantic correspondence.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work extended the Cross Attention Transformer (CAT) for communication between a pair of features from moving and fixed images, promoting the features matching for image registration.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The discussions on the existing deep registration and transformer-based registration models are inappropriate. There is a lack of architecture descriptions of the proposed full transformer-based registration model and the integration with existing registrations models.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This paper has provided details about the models, datasets, and evaluation.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. It is interesting to classify the existing registration methods as Fusion-first, Fusion-last, and cross-attention-based fusion. The authors claimed that the first two categories of methods failed to find the one-to-one correspondence between images. However, existing deep registration models, such as the diffeomorphic variant of the VM, were feasible to find the invertible registration fields.
    2. The transformer has been applied in image registration and correspondence tasks in the last two years, such as [4, 20]. The images patches from both the fixed and moving images are fed to the transformer, so the existing methods [4, 20] did not just compute the relevance in one image as claimed by the authors.
    3. It is unclear how to compute DVF \phi by the proposed XMorpher. As shown in Fig. 2, the Concat+Conv operations were required to compute DVF. Does it mean a CNN-based decoder was used to compute the registration field?
    4. In Fig. 3, all compared methods achieved reasonable deformation fields with the organ contours consistent with the fixed image. We noticed that the proposed approach achieved smooth organ boundaries. It would be helpful to discuss the scheme in the proposed approach contributing to the smooth boundary.
    5. It is unclear how to apply the XMorpher on the existing registration network of the VM or the PC-Reg. The VM utilized the U-net-based framework with the convolutional encoder for feature extraction and the decoder for the DVF. It would be helpful to discuss whether the VM-XMorpher used the CNN-based feature extraction and the field inference.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed full transformer architecture utilized the dual parallel feature extraction networks, which exchanged information through cross attention, discovering multi-level semantic correspondence for effective registration. The proposed XMorpher has shown performance gains over existing deep registration models in DMIR.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    The authors introduced a parallel transformer backbone, XMorpher, for image registration task. Different to current CNN-based registration network which moving-fixed image feature fusion first or fusion last, the proposed method utilized a cross-attention block to fuse the moving-fixed feature in multi-level progressively.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea is fine, and their experimental result shows the proposed method is efficent.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Some of the comments the author express on existing methods are subjective. For example, I don’t agree with their comments on the CNN-based registration methods including fusion-first and fusion-last, because the registration network is to learn the spatial deformation between the paired images, thus, the moving-fixed features extraction and matching should not split, while the authors claimed these two steps should split. Please list more experiment or reference to support your point.
    • The proposed XMorpher seems like require a huge training dataset, thus, the generalization seems like questionable. I think the large deformation might be limited by window size if you remove the affine transformation?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors claimled they will release their code publicly.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    1) I like the way the authors represent in Figure 1, but the authors should provide more evidence / reference to support what they claimed. 2) In Fig. 2, did you predict two deformation field? or just one? If you only predict one deformation field, you must fuse the moving-fixed features in up-decoder block, thus, your method should not be named X-shape, it actually is Y-shape. If you predict two deformation field, you cannot denote “moving and fixed image” as your inputs. Please give more details here. 3) The proposed XMorpher seems like require a huge training dataset, thus, the generalization seems questionable. Can you give more quantitative number like minimum required training dataset, network parameters number, and computing efficiency(FLOPS)? Besides, I think the large deformation might be limited by window size if you remove the affine transformation? 4) In Table1, why the model without cross block achieved the best performance on Jacobian metrics? Please discuss. Besides, your network is parallel which utilize cross block to fuse moving-fixed features. If you remove cross block, how does the model achieve registration task? Like comments 2), you fuse moving-fixed features in up-decoding block? If yes, your network is in Y-shape.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea and results are satisfactory.

  • Number of papers in your stack

    1

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The paper proposed a transformer architecture called XMorpher for DMIR. XMorpher includes dual parallel network to extract image features of the fixed and the moving image. Then, at each level the Cross Attention Transformer (CAT) blocks compute the mutual relevance between extracted features and match the corresponding regions to obtain the fine DVF. Results show some improvement compared to other methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The concept of dividing XMorpher into feature extraction part and correspondence matching part using Transformer framework is valuable. The proposed CAT block advances the feature communication between the fixed and the moving images in a multi-level semantic scheme. The results show the improvement of efficiency and accuracy of the XMorpher.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Figures and their captions need to be improved. And more information can be acquired in Q8.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    1, The figures (especially Fig.2) in paper need to be more clearly, and I list some advices in follows:

    • The texts used in Fig. 2 is too small to read that I have to zoom in 300%. You should reorganize the layout your figure.
    • Both in Fig.1 and Fig.2, the moving images and the fixed images seems are two different modality which is a little bit confusing.
    • In section 2.2, you mentioned input ‘b’ and input ‘s’, however, they are missing in Fig.2. Besides, the notations used in the whole text should be clearly defined and consistent, for example, typo error that in Section 2.3 ‘and thus S_ba has size of n×α ・ h×β ・w ×γ ・ d’, where I think ‘S_ba’ might be ‘S_se’.
    • Four figures in Fig.4(b) are lack of explanations. What are these pictures? Fixed image? Warp Image? or …? And, what the arrows and the overlap map represent?

    2, In your experiments, you applied the XMorpher as backbone in two CNN-based frameworks, VoxelMorph and PC-Reg. I don’t really understand this because you said XMorpher is a full transformer structure, and I hope to get more detailed information about the implementations. Also, you did not introduce how you acquire the final dvf in your article which is important. 3, You claimed that your method is more efficiency, please report your inference time. 4, There are many learning-based methods fuse features in multiple level and predict the dvf in multi-scale manner. The works will be much more solid if you can make some comparisons with these methods.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has a certain innovations, and the experiments also show the improvement of the proposed method compare to baseline models. Although some weakness exists, it can be accepted after correction.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Full Transformer for Deformable Image Registration via Cross Attention

    This submission proposes to exploit a multi-level cross-attention maps to register two images. To do so, transformer blocks are used to alternatively exploit mutual feature extraction and feature matching between pairs of images. The originality is that attention mechanism occurs on single images, whereas the proposed approach finds cross-attention between pairs of images. The evaluation is on cardiac images in a unsupervised and semi-supervised (using two different backbone registration frameworks). While the originality of using a cross-attention mechanism can be appreciated, the reviewers raise issues that requires clarifications in a rebuttal:

    • methodology - the reviewers find confusion and missing key information in the general methodological description (R1,2,3)

    • motivation - the conceptual highlight on the first-fusion and last-fusion issues in current registration approaches is challenged by the reviewers. R2 for instance disagrees that feature extraction and matching should be split - motivation and clarification on the evaluations assessing the need to split extraction and matching could strengthen the submission.

    • novelty - on the same line, transformers and attention maps are used in registration algorithms, as raised by the reviewers - highlighting the true novelty of the cross-attention could improve the originality of the proposed dual-path blocks.

    • validation - computational cost is currently not addressed and questioned by the reviewers.

    • several typos and missing definition (DVF?) are present in the submitted manuscript.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5




Author Feedback

We thank all reviews for their positive appreciation: 1 Great contribution (R1-“promoting the features matching” “effective registration”, R3-“advances the feature communication”) 2 Novelty (R1-“idea is fine” “interesting” R2-“idea and results are satisfactory” R3-“certain innovations”, meta-“originality can be appreciated”) 3 Accurate results (R1-“gain performance over existing” R2-“efficient” R3-“improvement of efficiency and accuracy”) 4 Nice visualization. (R1-“smooth organ boundaries”)

  • Motivation -Fusion-first and fusion-last(R1) We only claimed fusion-first networks have no “one-to-one correspondence” ability, but fusion-last networks do have this ability. Fusion-first networks have no this ability because they fuse the fixed and moving images for a mixed object before putting them into the network. Therefore, the information in the input object is mixed and loses the ability to learn the “one”-to-“one” respondence. The example, Diffeomorphic VMs, are fusion-first networks that still mix images before input, losing the “one”-to-“one” ability.

  • We claimed the attention mechanism in the existing transformer-based DMIRs[4,20] is designed for one image situation (R1). This is because their transformer backbones are still designed for single-input situations like other CNN-based fusion-first DMIRs which have to mix the information in two input images first. So, the attention in their backbones is only able to be calculated between the patches inner a mixed image space (not between images), limiting the ability for one-to-one relevance.

  • Our comments on the “fusion-first and fusion-last” are in line with the mainstream view(R2) The latest widely recognized registration review paper (10.1007/s00138-020-01060-x) divides the DMIR into the “Similarity Metric based” and the “Feature based” which corresponds to our definition. We renamed them in our way (“fusion-first” and “fusion last”) to make our introduction clearer and more focused on the backbone (where we improve) in registration.

-The clarification of feature extraction and matching(R2) We claim that both splitting (fusion last) and mixing (fusion first) of these two steps are all limiting the registration, not “moving-fixed feature extraction and matching should be split”. They are all unable to coordinate the extraction and matching of multi-level features due to the completely independent (challenging for feature matching) or mixed (challenging for feature extraction) information.

  • Validation
  • Clarification of the “Efficiency”(R2, 3) 1 “Efficiency” in contribution 1, 2 means our XMorpher has higher representation efficiency. Our cross-attention-based fusion coordinates the extraction and matching of features, making better feature representation. Compared with other transformer-based DMIR, our method (20M) only takes 1/5 parameters of TransMorph (108M) and achieves 1.9% higher DSC. The fewer parameters and higher accuracy show our representation “Efficiency”. 2 “Efficiency” in contribution 3 means the window-based method has higher computational efficiency. Attention in window is more efficient than global attention, which has been proved in some famous works such as Swin Transformer.

  • Details in Method
  • In our method(R1, 2, 3) 1 No, the “Concat+Conv” is just the head layer that projects the features to DVF. Our XMorpher is a full Transformer backbone whose encoder and decoder are all transformers.(R1, 3) 2 Our XMorpher replaces the original backbones in VM and PC-Reg to implement our method in experiments.(R1, 3)

  • In no-cross XMorpher(R2) 1 Implementation: It takes the fusion-first strategy which translates our cross-attention to self-attention and mixes two images first before input like VM. 2 Why smoother: Its mixed input makes the details unclear, and its patch-based representation exacerbates the challenges in details’ correspondence. So, it reduces the warping on details for smoother deformation but worse accuracy.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Full Transformer for Deformable Image Registration via Cross Attention

    The rebuttal has clarified the necessary methodological description and motivation. The manuscript could be updated accordingly without jeopardizing its overal status. Recommendation is therefore towards Acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The proposed method presents a sufficient novelty by introducing an inter-subject cross-attention mechanism in the deep learning-based deformable registration problem. The rebuttal addressed reviewers’ critiques, including clarification of the fusion mechanism (which was a critical question from the reviewers).

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    A perhaps technically sound work but with limited multidisciplinary novelty, questionable improvement and a lack of clinical relevance, as correctly pointed out by the reviewers.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



back to top