Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Tai Ma, Xinru Dai, Suwei Zhang, Ying Wen

Abstract

Large deformation image registration is a challenging task in medical image registration. Iterative registration and pyramid registration are two common CNN-based methods for the task. However, these methods usually consume more parameters and time. Additionally, the existing CNN-based registration methods mainly focus on local feature extraction, limiting their ability to capture the long-distance correlation between image pairs. In this paper, we propose a fast and accurate learning-based algorithm, Pyramid-Iterative Vision Transformer (PIViT), for 3D large deformation medical image registration. Our method constructs a novel pyramid iterative composite structure to solve large deformation problem by using low-scale iterative registration with a Swin Transformer-based long-distance correlation decoder. Furthermore, we exploit pyramid structure to supplement the detailed information of the deformation field by using high-scale feature maps. Comprehensive experimental results implemented on brain MRI and liver CT datasets show that the proposed method is superior to the existing registration methods in terms of registration accuracy, training time and parameters, especially of a significant advantage in running time. Our code is available at https://github.com/Torbjorn1997/PIViT.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43999-5_57

SharedIt: https://rdcu.be/dnwxa

Link to the code repository

https://github.com/Torbjorn1997/PIViT

Link to the dataset(s)

https://drive.google.com/file/d/1rJtP9M1N3lSjNzJ5kIzRrrwPe1bWCfXB/view

https://drive.google.com/file/d/17IiuM74HPj1fsWwkAfq-5Rc6r5vpxUJF/view

https://drive.google.com/file/d/19v5-qRF3KwA8Snf5ei-qtMv-nDYyXBzv/view

https://drive.google.com/file/d/1xQMmYk9S8En2k_uavytuHeeSmN253jKo/view


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a deep learning architecture for pairwise nonlinear image registration that better deals with situations where no initial rigid-body or affine alignment has been done.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The overall architecture seems moderately sensible for including an iterative framework for dealing with mispositioning. The encoding steps involved some nice weight sharing.

    The accuracy and speed of the method compare well against those of the baseline methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Not clear if the proposed network offers benefits when images are affine or rigidly aligned beforehand.

    Some parts of the architecture were not fully explained or motivated.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Some details of the network are missing, such as the nonlinearities used following convolutions, as well as numbers of channels and how these were chosen. Details of dimensions involved in the SWIN transformers were missing. Manuscript did not contain a placeholder for a repo.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Explain the motivation for aligning organs of one individual with those of another was not explained. Normally this type of thing is done to enable multi-atlas label fusion techniques to be applied, but deep learning methods seem to have superseded these methods. Alternatively, perhaps clinicians want to use registration to achieve simple comparisons among populations of subjects, but this would be a subtly different registration task.

    Clarify whether there was weight sharing across the LCD iterations.

    Better explain why LCD was used for the coarse alignment, but convolutions (with some sort of nonlinearity) were used to update the finer alignments. Why not the same throughout?

    It would have been enlightening to see how the behavior of the methods changed if all images had been affine aligned beforehand, and whether the iterative approach would offer any benefits in this situation. It would also have been useful to give some form of measure of the positioning variability.

    Large deformations generally refer to large nonlinearities (as in large deformation diffeomorphic metric mapping), rather than the ability to handle variability in overall positioning.

    Rather than adding updates to displacement fields, perhaps consider composing them together. Alternatively, feed the current \phi as well as F_m(\phi) and F_f into the parts of the network that estimate updates.

    As deep learning is being used to optimize some objective function, it would be useful to compare the resulting values for that objective function with the different architectures to assess their effectiveness.

    The text says “requires huge GPU memory”. Clarify whether this is for training or deployment.

    Figures 2 and 4 show the brain of a subject lying face down in the scanner. Show them face up instead.

    Check the capitalization used in the references.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The accuracy and speed seem relatively convincing, and many aspects of the architecture seem sensible. Some parts could probably be improved slightly though.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    My opinion post-rebuttal has not changed.



Review #2

  • Please describe the contribution of the paper
    1. The authors propose a pyramid-iterative registration framework. This framework extracts feature map pairs via a dual-stream weight-sharing encoder, performs iterative registration on the low-scale feature space, and finally complements detail information and learns deformation fields during pyramid decoding.

    2. The proposal of a Swin Transformer-based long-range correlation decoder

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper proposes a novel framework for pyramid image registration to tackle the problem of large deformations.
    • The paper furthermore explores the usage of Transformer models for image registration showing promising results.
    • The paper is well-written.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The clinical relevance for an image registration method for intra-patient brain MR and even more liver CT is not clear. Please explain why such registration is helpful. For population studies, you could also just segment the structures and then compare volume etc. Why do we need voxel-wise correspondences?
    • The authors claim that other registration methods are “relatively not suitable for large deformation image registration”. There are already enough paper that have shown that also CNN-based methods are capable of handling large motions. However, the more relevant point is if they are also capable of registering the image locally accurately enough. Please don’t try to justify your own paper by making the other bad. That is not necessary.
    • The authors don’t discuss failure cases or limitations of the proposed method.
    • The results are “obviously” better than the other methods, however, no statistical test was performed
    • There are a lot of open questions I had after reading the paper.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors don’t say that they will make the code available or already have done it. Furthermore, no statistical analysis was performed. It’s not clear to me if the parameters of the other methods were changed or kept like in the original paper.

    1. For all code related to this work that you have made available or will release if this work is accepted, check if you include: Specification of dependencies. [Yes] Training code. [Yes] Evaluation code. [Yes] (Pre-)trained model(s). [Yes] Dataset or link to the dataset needed to run the code. [Yes] README file including a table of results accompanied by precise command to run to produce those results. [Yes]
    2. For all reported experimental results, check if you include: The range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results. [Yes] Information on sensitivity regarding parameter changes. [Yes] The exact number of training and evaluation runs. [Yes] Details on how baseline methods were implemented and tuned. [Yes] The details of train / validation / test splits. [Yes] A clear definition of the specific evaluation metrics and/or statistics used to report results. [Yes] Discussion of clinical significance. [Yes] A description of the computing infrastructure used (hardware and software). [Yes] An analysis of situations in which the method failed. [Yes] A description of the memory footprint. [No] The average runtime for each result, or estimated energy cost. [Yes] An analysis of statistical significance of reported differences in performance between methods. [Yes]
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The authors say that “cascading several CNNs, which requires huge GPU memory”. How much does this need in comparison to the proposed method? There are multi-level approaches that can be trained on a 12GB GPU and need only 4GB for inference. Is this a huge GPU? Open Questions:
    • Why are the number of iterations chosen to be 150 000? Is that only optimized for the proposed method or is this also optimal for all other methods?
    • Which segmentation is used for the liver CT registration? The liver? How is that obtained? The liver is quite a large organ. A high Dice for liver overlap doesn’t say anything about the registration accuracy as the
    • Why is the number of foldings only evaluated on the LPBA dataset and not for the liver CT registration?
    • Why is the Voxelmorph registration similar fast on the GPU but waaaay slower on the CPU compared to the proposed method?
    • Can you please name a time-sensitive task where the difference of 0.04s is important (for intra-patient registration)?
    • What kind of labels are shown in Figure 2 of the supplementary material? There are quite some different colours why? The voxelmorph results look like the hyperparameters aren’t well defined. And also for the proposed method, the deformation field seems to have a lot of foldings!
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In general, this is an interesting registration paper (even if I would like to see a different application than Brain MR!!). There are still some questions left open and some revision is needed. This can be done during the rebuttal.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The authors will slightly improve their manuscript (if they include the answers to the questions into their paper, e.g. number of foldings, statstical test, limitations and also talk about failure cases!!) and therefore, I will change my review to “weak accept”



Review #3

  • Please describe the contribution of the paper

    This paper presents a registration network that is robust to large displacements. It builds on top of a pyramid architecture that has become popular these last few years, where a downward feature-extraction arm is applied to both input images (with shared weights), and the displacement field is progressively decoded in the upward arm. This architecture is modified by including an iterative refinement of the displacement field at the deepest layer. This iterative refinement uses a swin-transformer block, while all other levels use simple convolutional blocks. The network is trained on ABIDE+ADHD+ADNI and tested on LPBA, where it performs positively compared to a range of recent baselines. The type of refinement block and number of refinement iterations are evaluated in an ablation study.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method takes advantage of recent advances in registration networks: siamese feature extractors, progressive pyramidal prediction, and transformers. It only uses the transformer block at the coarsest level, making the architecture much more lightweight than other transformer-based networks. The network is properly evaluated (on an entirely different test dataset – for which I commend the authors) against SOTA baselines. I’ll add that the paper is well written and easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    My main concern is related to the use of LPBA for validation, which only includes labels for the gross cortical lobes, and does not differentiate gray and white matter. It is actually quite clear in the supplementary figures that many cortical folds do not align. It is also unclear if competing methods were retrained on the same data, or if they have been used off-the-shelf with pretrained weights.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Many quantitative details about the architecture are not provided (patch and window size in the LCD, number of features, etc.). While it is stated that the code will be made available, I believe it is beneficial to also provide technical details in the paper (eventually in supplementary material).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The LCD module does not seem to do much better than the GRU module in the ablation study, both in terms of accuracy, TPI, and number of parameters.
    • I would really appreciate an evaluation on a dataset that includes fine cortical labels, such as mindboggle.
    • In Supp Fig 2, it looks like you linearly interpolated a label map, hence the small colored voxels at the boundary of the liver. It is a bit distracting.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well written and easy to follow. It does not only focus on Dice, but also takes model weight and runtime in consideration. It reaches SOTA results on the test datasets. However, in my opinion, it uses suboptimal test datasets, and is quite incremental, which is why I only give a “weak accept”.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    My score before rebuttal was 5, although borderline 4/5. The rebuttal did not lift any of my concerns. On the contrary, the relatively low dice scores on mindboggle confirms my fears (i.e., fold alignment is very poor). I will therefore stick to my previous score of 5.

    There seems to be a consensus among reviewers that this paper (and other quite similar ones that I’ve reviewed) is frustrating because while the science is relatively sound, the methods are extremely incremental, are restircted to toy problems without real world application, and only yield small improvements above SOTA.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Overall, the reviewers see this paper as borderline. I am inviting the paper to rebuttal, and would like to emphasize that it would be useful for the authors to address the serious concerns from the reviewers (including those recommending a weak accept.

    First, please address the concerns from R2 – it does seem that the paper is lacking in motivation, and is needlessly and inappropriately ignoring the substantial effort and progress the field has made. A lack of statistical assessment (on a completely held-out set that was not seen during training or validation) is also required to make sense of the results, as R2 mentions.

    All reviewers have concens about the experimental limitations as well, these should be clearly justified and addressed in the rebuttal.




Author Feedback

We sincerely thank the useful suggestions from the reviewers. PIViT will be opened on GitHub soon. We’ll fix some linguistic and typo mistakes in the revision. R#1&2&3 Following the reviewer’s comments, we perform registration experiments and statistical assessment (p-value) on ABIDE+ADHD+ADNI for training, LPBA for validation, and Mindboggle for testing. In addition, to verify the performance of PIViT in local accurate alignment, we calculate the Dice score on Mindboggle with and w/o affine alignment. Method | w/o Affine | Affine PIViT | 54.7 (-) | 56.7 (-) VM | 43.2 (p<0.01) | 54.5 (p<0.01) VMdiff | 45.7 (p<0.01) | 54.1 (p<0.01) RCN | 56.6 (p=0.13) | 59.8 (p<0.01) LKU-Net | 45.8 (p<0.01) | 56.0 (p=0.15) NICE-Net | 52.2 (p=0.12) | 58.4 (p<0.01) Mindboggle is a brain dataset with fine cortical labels. Experiment results show that the lightweight PIViT outperforms most methods without affine alignment, but does not differ significantly from RCN and NICE-Net (p>0.05). On the affined dataset, PIViT may not exhibit an obvious advantage in local fine registration, but it still achieves satisfied alignment. Thus, PIViT is more suitable for aligning images with large deformations. R#1&2 Q1: Explain the motivation for aligning inter-patient scans. A1: Considering that deep learning methods require large amounts of public datasets mostly containing inter-subject images, we use them to test the model’s generalization and to avoid overfitting. Though inter-subject registration is less common compared to intra-subject registration, it’s still valuable for aligning a patient’s scan with a healthy or standard scan for comparison. And both inter-subject data (LPBA, SLIVER, LiTs) and intra-subject data (LSPIG) are employed in our manuscript. Q2: Compare the GPU memory required. A2: The GPU memory required for training PIViT, RCN and NICE-Net is 3131MB, 9297MB, and 9271MB respectively. R#1 Q3: Is weight sharing used across LCD iterations? A3: The experiment results show that LCD without weight sharing performs better. Q4: Why LCD is used for coarse alignment, but CNNs used for finer? A4: LCD is suitable for capturing long-range dependencies with large resources on high scales. CNN excels in capturing local correlations. Thus, this operation balances computational cost and accuracy. R#2 Q5: Limitations of PIViT. A5: Since PIViT is a lightweight model, it is slightly inferior to other deeper models in capturing local fine differences and preserving diffeomorphic properties. Q6: Problems with the number of iterations and the parameter. A6: The number of iterations follows VM’s previous work, and all methods converge sufficiently. The parameters of other methods remain consistent with their papers. Q7: Concerns about GPU and CPU time. A7: To explore the above issue (Table 1), we re-test the GPU and CPU time of PIViT with a 3-iteration CNN decoder, which is 0.06s and 0.24s, respectively. The difference in GPU time is minimal, while the difference in CPU time is significant. These results indicate that both PIViT and VM approach the maximum GPU inference speed on current device and data, while the CPU time difference reflects inference speed disparity. Q8: Regarding the issues of liver CT datasets and foldings. A8: We use open-source liver datasets provided by RCN, which include liver anatomy labels and some edge point labels (represented by different colors in Supplementary Material Fig.2). We only use liver labels to measure the Dice score and use liver data to validate PIViT’s robustness to large deformations. The hyperparameters of all the methods are consistent with their papers. The number of foldings on SLIVER is 50302 on VM and 11183 on PIViT, which has not been reported due to manuscript length. R#3 Q9: Were other methods retrained on the same data? A9: Yes. Q10: LCD seems inferior to GRU. A10: LCD is chosen for its near-maximum performance with fewer iterations. However, GRU is indeed an effective option as reported in our manuscript.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    After the rebuttal, the reviewers agree on a weak accept, but emphasize that the promised items that need fixing are done before the Camera Ready. The paper will be interesting to discuss at the conference and I congratulate to the authors. Please ensure that the required changes are made.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The technical contribution is deemed incremental by all reviewers, yet there are no major concerns with the paper. The improvements are rather small and I doubt that inter-subject brain registration is a good choice of promoting and benchmarking long-range correlation and iterative registration. The liver registration adds some insight but there are many multi-organ tasks that would be more suitable and enable a fair comparison with SOTA. Overall, the reviewers (myself included) mention that both applications is are not really clinically relevant and also not as challenging as many others (see tasks in CuRIOUS, BratsReg or Learn2Reg apart from OASIS). There is a rather large amount of concurrent and similar MICCAI submissions this year that are all evaluated in part on the same brain datasets. Hence, only a small fraction of them can be accepted. One reviewer that rated at least two of those similar papers expressed their frustration about this and I share their view. So despite some merit, I recommend to reject the paper, since the community does not directly benefit from yet another paper that incrementally improve upon a very similar method from last year and actually falls a little short of similar papers submitted this year.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Although the rebuttal includes experimental details and new experiments, the reviewers’ critiques regarding limited novelty (i.e., hierarchy in Vision Transformer) and performance improvement over STOA methods remain unaddressed. The merit of the proposed method in large deformation estimation is not clearly evaluated through the experiments.



back to top