Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Yongpei Zhu, Shi Lu

Abstract

Deformable medical image registration is widely used in medical image processing with the invertible and one-to-one mapping between images. While state-of-the-art image registration methods are based on convolutional neural networks, few attempts have been made with Transformers which show impressive performance on computer vision tasks. Existing models neglect to employ attention mechanisms to handle the long-range cross-image relevance in embedding learning, limiting such approaches to identify the semantically meaningful correspondence of anatomical structures. These methods also ignore the topology preservation and invertibility of the transformation although they achieve fast image registration. In this paper, we propose a novel, symmetric unsupervised learning network Swin-VoxelMorph based on the Swin Transformer which minimizes the dissimilarity between images and estimates both forward and inverse transformations simultaneously. Specifically, we propose 3D Swin-UNet, which applies hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with patch expanding layer is designed to perform the up-sampling operation to estimate the registration fields. Besides, our objective loss functions can guarantee substantial diffeomorphic properties of the predicted transformations. We verify our method on two datasets including ADNI and PPMI, and it achieves state-of-the-art registration accuracy while maintaining desirable diffeomorphic properties.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16446-0_8

SharedIt: https://rdcu.be/cVRSO

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

In this paper, the authors propose to a framework to do image registration usin swin transformer. The architecture is the swin unet: a unet-like architecture with convolution blocks replaced by swin transformer blocks. The output of the network are the geometric transformation from each input image to the other one. The (symmetric) loss minimize: image dissimilarity and inverse consistency between the transformation and penalize irregular transforms and negative Jacobians.

The method is compared to state of the art methods on brain MRI images and evaluated using Dice on cerebrale structure and percentage of negative Jacobian.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

– as far as I known, this is the first implementation of swin unet for image registration

– the method compared favorably to state of the art w.r. dice, negative Jacobian percentage and computational time.

– the resulting registration method have all the good expected properties of a registration method: regularity, symmetry, invertibility.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

– only minor weakness (see below)
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

very good reproducibility, with code and dataset available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- The dice evaluation has been done using Freesurfer segmentation has ground truth. In an extended version, it would be interesting to see also the results on a dataset with manual annotations (such as the IBSR or the MICCAI multi atlas 2012 challenge dataset). Using these data only in the test would also help reinforcing the results regarding the generalization.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is clear, the proposed pipeline is, as far as I know, the first implementation of swin unet for image registration. The resulting registration method include all good properties expected: regularity, invertibility, symmetry. Runtime and quality metric compare favorably to state of the art.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #2

Please describe the contribution of the paper

The authors present a deformable image registration network based on the Swin Transformer - Swin-VoxelMorph. They use the ADNI and PPMI datasets to evaluate the model. They achieve an average Dice Similarity Coefficient (DSC) of 0.775.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main strengths of the paper lie in the fact that they use a transformer based network for deformable image registration task. The paper is well written and easy to follow.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The main weakness of the paper is with the figures. Figure 3 is hard to follow. Are the warp field and Jacobian determinant corresponding to the top images or the bottom images? The colorbar is also very hard to see.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper seems reproducible. Although, the authors do not mention in the text whether they plan to make the code publicly available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The authors should improve the figures and make Figure 3 easy to understand.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is very well written and easy to follow and the only weakness is the clarity and description of the figures.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

7
[Post rebuttal] Please justify your decision

The authors answered my queries and I will stick with my original rating.

Review #3

Please describe the contribution of the paper

This paper presents a method for deformable medical image registration using Swin Transformer. The technical novelty of the method is to explicitly exploit Swin Transformer for deformable medical image registration, and utilize orientation and inverse consistency constraint to guarantee the topology-preservation and inverse consistency of the predicted transformations.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

It constructs a objective functions including orientation and inverse consistency constraint to guarantee the topology-preservation and inverse consistency of the predicted transformations.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The The structure is similar to reference [6]. The main difference is that swin transformer replaces fully revolutionary networks. The technical novelty of the method is limited and the performance increment is incremental.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper is easy to follow, though some of the details are not clearly described.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

One of my main concerns with this work is a lack of precision in the explanation of steps which are essential for the whole apparatus to work well and be credible. For example, the authors state in the ‘symmetric loss terms’ 2.2 subsection: “where φMF and φFM are difffferentiable and invertible in a bidirectional fashion”. How is this guaranteed?This can be precisely formulated in mathematical terms.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The problems to be solved are challenging, and it is worth encouraging to try to find a solutions with the latest methods. And this paper takes into account the topological consistency and differentiability for registration.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Somewhat Confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper received reviews that lean towards accept, However, the reviewers still have several concerns, and unfortunately did not delve deeply into the paper details. In order to more deeply evaluate the paper, I also took a look.

It seems like the scope of the paper is to apply the Swin transformer ideas to medical image registration. The novelty is somewhat limited, as there are a substantial number of transformer-applied-to-registration papers in the last year. Essentially this is the application of an architecture to registration using a now-standard DL registration framework. It would be nice to have a discussion of how the paper differs, or some analysis of why transformers should be useful at all. I understand there is a general idea that transformers are more spatially flexible than CNNs, but the multi-scale (pyramid) used in essentially all CNN architectures for registration are quite powerful at capturing this.

It would also be really important for the authors to clarify how they chose the hyperparameters for the baselines, as of course this can be a very important determinant of results. Were the baselines tuned like the current method was?

A bit more minor, but I would encourage the authors to choose a name that is unique to their method.

Overall, I believe the paper is quite borderline, with the reviews learning accept but not being very thorough, and overall I have several concerns. I encourage the authors to very carefully address these issues, and clearly articulate the evidence for the transformer architecture being useful in this setting.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

5

Author Feedback

Thank you for your valuable comments. We will explain your concerns carefully. R#1-About the results on other datasets. A1: Thank you for your suggestions. Considering the length of the paper, we may add the results on the IBSR dataset in the Suppl. R#2-About the Figure 3. A1: Thank you for your questions. We will revise the Figure 3 and our paper in the final version. Here we give two example MR slices about different experiments. The warp field and Jacobian determinant are corresponding to the top images. The colorbar shows the value ranges of Jacobian determinant. R#3-About the difference with [3] or novelty. A1: Thank you for your questions. We would like to clarify again: (1) The structure is different from [3]. In the introduction, we have shown that other methods like [3] conduct an inference from the CNN-based (like fully convolutional networks) low-level local embedding without considering the global relevance of the image pair. Thus, the resultant alignment may suffer implausible voxel-wise mapping, where the prior affine transformation and landmark annotation are required to circumvent the trap of local minima. To solve it, we are the first to propose Swin-VoxelMorph for medical image registration. (2) Our objective functions are also different from [3]. We set Lsim to the mean squared voxelwise difference but not NCC. We extend the loss by adding an inverse consistency constraint Lcons-pair. Swin-VoxelMorph achieves the best performance in average Dice, while producing less non-positive Jacobian voxels than others. (3) About symmetric loss terms: Ref. [3]. Our objective functions can guarantee the predicted difffferentiable and invertible transformations in a bidirectional fashion. Meta-R#1: Q1: About some analysis of transformers. A1: Thank you for your concerns. In the introduction, we have shown some analysis of the reasons. Registration is the process of establishing such correspondence by comparing different parts of the moving to the fixed image. Unlike CNNs, one point is that the self-attention mechanisms in a Transformer have an unlimited size effective receptive field. A CNN has a narrow field of view: it performs convolution locally, and its field of view grows in proportion to the CNN’s depth, the shallow layers have a relatively small receptive field, limiting the CNN’s ability to associate the distant parts between two images (Ref. [1]). The U-Net (or other multi-scale (pyramid) modules) was proposed to overcome this limitation by introducing down- and up-sampling operations. However, several problems remain: (1) the receptive fields of the first several layers are still restricted by the convolution kernel size, and the global information of an image can only be viewed at the deeper layers of the network; (2) as the convolutional layers deepen, the impact from far-away voxels decays quickly (Ref. [2]). However, Transformer is capable of handling such issues and focusing on the parts that need deformation. Specifically, another important point can be found in the response (1) to the R#3. Q2: About the hyperparameters for the baselines. A2: (1) We select our model that obtain the highest Dice on the validation set and get the best results with the weights of every loss term which were tuned by grid search. For other baselines, we used their trained models or their used hyperparameters to train and attained the results. (2) We have done some ablation studies to choose the best hyperparameters for our method, such as the effect of up-sampling, the number of skip connections, the input size and the linear embedding dimension C. However, considering the length limit, we will give the details in the Suppl. [1] Understanding the effective receptive field in deep convolutional neural networks. NIPS2016. [2] Medical image segmentation using squeeze-and-expansion transformers. arXiv preprint arXiv:2105.09511. [3] Fast symmetric diffeomorphic image registration with convolutional neural networks. CVPR2020.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This is a very borderline paper. I am overall unhappy with the answers from the authors, including the explanation about transformers:

In the introduction, we have shown some analysis of the reasons. Registration is the process of establishing such correspondence by comparing different parts of the moving to the fixed image. Unlike CNNs, one point is that the self-attention mechanisms in a Transformer have an unlimited size effective receptive field. A CNN has a narrow field of view: it performs convolution locally, and its field of view grows in proportion to the CNN’s depth, the shallow layers have a relatively small receptive field, limiting the CNN’s ability to associate the distant parts between two images (Ref. [1]). The U-Net (or other multi-scale (pyramid) modules) was proposed to overcome this limitation by introducing down- and up-sampling operations. However, several problems remain: (1) the receptive fields of the first several layers are still restricted by the convolution kernel size, and the global information of an image can only be viewed at the deeper layers of the network; (2) as the convolutional layers deepen, the impact from far-away voxels decays quickly (Ref. [2]). However, Transformer is capable of handling such issues and focusing on the parts that need deformation. Specifically, another important point can be found in the response (1) to the R#3.

The first statements are completely wrong about the effect size and so forth, specifically because the pyramid deals with this. I am not sure why the authors repeat it, but they then say that it is, after all, not an important point. The other points are similarly loose and hand wavy.

Unfortunately, this paper did not have thorough reviews, so it’s up to us evaluate and look at the discussion carefully. I am unfortunately quite concerned with the replies in the rebuttal, that do not really address the concerns but rather repeat hand-wavy, unproven statements about the transformer. Similarly, some concern from other reviewers are poorly handled such as listing the loss (mse vs ncc) as an important difference compared to a method.

I unfortunately lean towards rejection, and strongly suggest to the authors to carefully evaluate all this feedback. There is a very heavy literature on transformers and transformers for registration, and the goal is to carefullyu elucidate the contribution to science. Right now, I don’t quite see it.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

-

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Unfortunately, I have to follow the direction of the first Meta-Reviewer for this paper and down-weight the reviews that are positive but also surprisingly short and not very informative. In particular they only partially address the limited novelty, i.e. relations to TransMorph (published on ArXiv November 2021) and the incomplete evaluation on automatic FreeSurfer labels. To be clear, the paper is really well written and there are no serious mistakes or inaccuracies in the description - I would definitely give high scores for the presentation. But the method itself has little contribution other than replacing certain CNN layers with Swin Transformers - this has been done in TransMorph, but that paper contains many more ablations and much more convincing experimental validation (using real segmentation, more competitive baselines from Learn2Reg, other modalities and anatomical sites etc.). The fact that this submission uses MSE instead of NCC is not necessarily a strength (when considering applications aside brain). I also found the response in the rebuttal: “we may add the results on the IBSR dataset in the Suppl.” to ambiguous for such a serious concern. In summary I think this is a solid paper that is just marginally below the threshold of acceptance in my batch.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

10

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

All reviewers originally suggested accepting the manuscript, and their scores have not changed since (7, 7, 5). However, Meta-reviewer #1 was less positive with regards to the novelty of transformer-applied-to-registration papers. This appears to be one of several MICCAI submissions involving SWIN transformers, and there are already a few arXiv papers on the subject.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

3

Meta-review #4

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

AC recommendations on this paper were split with a majority vote of “rejection”, while the reviewers expressed consensus in supporting acceptance including after reading the rebuttal. The PCs thus assessed the paper reviews, meta-reviews, the rebuttal, and the submission. It is noted that the reviewers highly appreciated the novelty of the presented work in terms of as one of the first adoption of swin transformer in image registration, solving a challenging problem taking into account the desired properties of the registration problem, and SOTA performance demonstrated. While the ACs expressed additional concerns of the work including incremental technical contribution, they mostly also considered the paper to be on the very borderline of acceptance and rejection. The PCs thus agreed with the convincing arguments of the reviewers and felt that the weaknesses as pointed out were outweighed by the strengths listed. The final decision of the paper is thus accept.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

NR

back to top

Swin-VoxelMorph: A Symmetric Unsupervised Learning Model for Deformable Medical Image Registration Using Swin Transformer