Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Alexander Bigalke, Lasse Hansen, Tony C. W. Mok, Mattias P. Heinrich

Abstract

State-of-the-art deep learning-based registration methods employ three different learning strategies: supervised learning, which requires costly manual annotations, unsupervised learning, which heavily relies on hand-crafted similarity metrics designed by domain experts, or learning from synthetic data, which introduces a domain shift. To overcome the limitations of these strategies, we propose a novel self-supervised learning paradigm for unsupervised registration, relying on self-training. Our idea is based on two key insights. Feature-based differentiable optimizers 1) perform reasonable registration even from random features and 2) stabilize the training of the preceding feature extraction network on noisy labels. Consequently, we propose cyclical self-training, where pseudo labels are initialized as the displacement fields inferred from random features and cyclically updated based on more and more expressive features from the learning feature extractor, yielding a self-reinforcement effect. We evaluate the method for abdomen and lung registration, consistently surpassing metric-based supervision and outperforming diverse state-of-the-art competitors. Source code is available at https://github.com/multimodallearning/reg-cyclical-self-train.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43999-5_64

SharedIt: https://rdcu.be/dnwxh

Link to the code repository

https://github.com/multimodallearning/reg-cyclical-self-train

Link to the dataset(s)

https://learn2reg.grand-challenge.org/Datasets

https://med.emory.edu/departments/radiation-oncology/research-laboratories/deformable-image-registration/downloads-and-reference-data/copdgene.html


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper present a novel DNN-based strategy to register 3D medical images. The method is unsupervised, and therefore does not require costly annotations. It is based on a cyclical self-training, where the two following steps are alternated: (1) labels (here the deformation) are computed based on the current model and are randomly drawn at the initialization; (2) the model is trained on the current labels. As clearly stated in the introduction, the principle was successfully used in image registration in [3] and [25]. The authors propose an alternative formulation following the principles described in the end of Section 2.2. Results obtained on the Learn2Reg dataset and the DIR-Lab COPDGene dataset show the efficiency of the approach compared with standard and state-of-the-art registration algorithms such as ANTs, SAME, VoxelMorph among others, and their registration strategy with MIND and NCC similarity metrics.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The methodology is of high interest for the MICCAI community and is very well motivated.
    • The novelty of the methodological contribution, compared with existing literature, is well identified.
    • The experimental protocol and results are convincing in my opinion.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The methodology section is hard to follow.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The data are public, but the code does not appear to be distributed somewhere.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    As mentioned above, my main concerns with this paper deal with the clarity of the methodology :

    • A more formal description of the contributions, in particular in the 2nd paragraph of Section 2.2 would help to clarify these contributions. Their description is broad.
    • An algorithm, explaining how the different parts of the methodology interact, would emphasize the key contributions. Alternatively, these key contributions could be mentioned in Fig.1
    • I don’t see the link between the ‘Keypoint registration’ paragraph and the rest of Section 2 .

    Although the quantitative and qualitative results are convincing, there is no discussion explaining why the authors believe that their contributions helped to obtain these good results. Maybe, fewer discussions could be given in Section 1 to give more space to stronger discussions in the end of the paper.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The methodology appears as promising to me, but its description is not clear enough to fully understand what was made by the authors.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper addresses the problem of feature-based registration with deep-learning. The method is evaluated in the registration of abdominal and lung structures from CT. Data was obtained from the Learn2Reg challenge. The method is founded on previous work, namely [20], and improves that method combining the optimization with a CNN based feature extractor. The authors empirically realized that this combination improved the initial optimization and regularized the error of computed correspondences from sub-optimal models. These empirical observations lead to the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The application is interesting and relevant to Miccai community.

    • The results show improvement with respect to the state of the art.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The proposed method is heavily based on previous work.

    • The proposed improvements are limited.

    • The method suffers form a superficial explanation. Although the ideas are explained the reader does not get a clear idea of the different stages of the method. The code lacks from a suitable modularity and it is hard to understand.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The datasets used for evaluation are publicly available.

    The authors provided the code and models in an anonymous GitHub repo. However, all functionalities are gathered into a single file of 674 lines of code. A huge training / testing method has been writing and it is impossible to figure out the flow of the algorithm and connect the lines of code with the manuscript. I do not feel able to exactly reproduce the training phase to start.

    For the evaluation, the authors provided a clear description of metrics and tendency. Statistical significance was stated when needed.

    The average runtime in the testing phase was provided. However, it is important to know the runtime in the training phase and the memory footprint. These magnitudes are not provided.

    The clinical significance of the method can be inferred from the introduction. However, the proposed method needs further validation in more different datasets for considering moving to clinical application.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The addressed problem is relevant to Miccai community. The approach selected reminds me to DeepNet, an optical flow method which combined feature selection with the optical flow method called Large Displacement Optical Flow. In this case I believe that the proposed method is restricted to the registration of the features (point-based registration). I believe that this is the right way to go for abdominal images.

    The manuscript is hard to follow since the optimization stage is not presented and it is hard to imagine how is the feature selection network combined with the optimizer. A more self-contained manuscript or adding this information in the supplementary material would make a +.

    I believe that, according to Miccai scoring, this is a fair paper with weakness slightly weigh over merits. My impression is that the proposed improvements are the result of empirical observations and some of the justifications are hard to understand. In the following, the aurhors can find different appreciations that may help improve the quality of the manuscript.

    Abstract and Introduction. When the authors mention supervised learning, they are only focused on point-based approaches. However, I believe that the relevant family of supervised learning from a ground truth estimated from traditional methods (e.g. QuickSilver or FlashNet) should also be mentioned.

    Abstract. What is a feature-based differentiable optimizer? Maybe citing an example of a prototype method of this kind would be helpful to understand. Introduction. The authors mention NCC and MIND as popular image-based metrics in image registration. However, to my knowledge, MIND has not been considered as standard in typical image registration methods (e.g. Jan Modersitzki’s book). I would recommend limiting to SSD, NCC, and MI.

    Introduction. “… the performance of the trained deep learning models is inferior to classical optimization-based counterpart.” May the authors provide an appropriate citation of this fact? As far as I know, ANTS performance has been surpassed in terms of DSC accuracy in methods such as SyMNet, LapIRN or TransMorph. Another story is what happens with transformation qualities…

    Introduction. Contributions. I have an existential problem with the claim that the improvement shown in the orange line in Fig. 2 is due to an inductive bias. According to the figure, it is correct that adding the network to the optimizer slightly improves the outcome. However, the reason may be that a random initialization works better for the optimizer or other unknown reason beyond an inductive bias, why not? By the way, what happens to the outcome when the authors use the initialization given in the blue line (is it possible?)

    Methods. I found hard to understand which class does the method belong to. It is a point-based registration method, right?

    Methods. In Equation (2) the authors are using TRE metric as a loss function. Is there any problem with the fact that TRE is used as evaluation metric in Table 3?

    Methods. As I said, I found hard to clearly understand the different stages and order of the method. The code is not easily understandable.

    Experiments. The authors included the SdlogJ metric for assessing the “quality” of the resulting transformation. This metric was proposed as standard in the Learn2Reg framework. However, the official code in Learn2Reg clamps the negative Jacobians. The metric is measuring the amount of deformation of the transformations and the assumption is that methods with lower SdlogJ are better. In my humble opinion, there is no relationship between this metric and the overall quality of the transformation. I may spend pages and pages explaining my reasons. I believe it would be more informative to show the maximum and minimum achieved Jacobian and the percentage of negative Jacobians.

    Experiments. Since the datasets come from Learn2Reg, is it needed to reproduce the processing performed to the data as part of Learn2Reg challenge framework? Did the authors performed any further processing needed by the proposed method to improve the results?

    Experiments. Results. I did not understand the meaning of the curves in Fig. 2. It seems something similar to a ROC curve, so I interpreted the figure this way. May the authors describe the computation of the curve for a better understanding?

    Experiments. Results. The authors show results collected from [9] and [27]. Are they comparable with fairness? If not it should be stated. It is quite common to find comparisons that are not fair in the state of the art. I believe we should break the tendency at least stating to what extent can we compare the numbers.

    Experiments. Table 2. Why is not available the SdlogJ metric for ANTs, DEEDS and SAME? Revisiting my previous question, is it fair to include ANTS in the comparison?

    Experiments. May the results be compared with the metrics given in Learn2Reg challenge?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As I said, I believe that, according to Miccai scoring, this is a fair paper with weakness slightly weigh over merits. My impression is that the method is heavily based on previous work. In addition, the flow of the method is hard to understand. I have serious concerns with some of the justifications given for the final design.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    I first apologize for not being able to identify the class of method. The term “feature-based” has been traditionally identified with “point-based” but now there is the possibility to use the term feature for image representations yielded by CNNs. I have revised my review and I still believe that all the comments are valid once identified the kind of method. So I would appreciate any actions conducted by the authors in order to improve the manuscript with my suggestions.



Review #3

  • Please describe the contribution of the paper

    The paper presents a novel unsupervised neural network training paradigm for deformable image registration based on the teacher-student model. During the training process, the teacher model generates pseudo displacement fields, which the student model optimizes as the target. The weights for the teacher model are periodically updated by copying from the student model. Overall, this approach represents a potentially valuable contribution to the field, as it provides an alternative to existing unsupervised registration methods and has the potential to improve registration accuracy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper presented a novel self-training for unsupervised registration approach, which build on the teacher-student model introduced previously but with several modifications to adapt it for medical image registration. One notable aspect of the proposed method was its ability to be trained without labeled data, making it a promising approach for unsupervised registration tasks. While several training tricks were used, such as weighted sampling, pseudo label refinement, and learning rate warm restart, the authors have included ablation studies to address some of the potential concerns about the effectiveness of the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. While the paper offers a valuable contribution to the field, its major weakness lies in the experimental methodology. Although there appears to be a noticeable improvement in terms of the Dice coefficient, as reported in Tables 1 and 2, the lack of standard deviations makes it difficult to evaluate the statistical significance of the results, which undermines the credibility of the experimental findings.
    2. One major contribution of the paper is its adoption of the teacher-student model without using any labeled data for initialization. The proposed approach is based on the assumption that convolution networks can produce reasonable output even without training. However, this raises questions about whether the proposed self-training paradigm is limited to convolutional neural networks.
    3. The main experiment of the paper is conducted using data from the L2R challenge. However, the authors did not evaluate their proposed method on the challenge test data, which is the standard practice for benchmarking registration methods. Instead, they split the original training dataset into training and testing sets. This goes against the purpose of using a public dataset, which is to enable direct comparison between published algorithms. It is worth noting that the performance of comparison methods in the paper appears to differ from the numbers reported in the challenge. For instance, LapIRN achieved a DSC of 0.67 on the testing set, but only 0.42 in the paper.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have released the source code of this work. The results presented in the paper can be reproduced by others.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The rationale behind the use of stop-gradient, random augmentation, and pseudo label finetuning is presented in a confusing way in Section 2.2. While these concepts are somewhat related to contrastive learning, there is no contrastive loss used in the proposed method. Instead, it appears that the proposed approach is more closely related to the noisy student model (citation [24] in the manuscript). Specifically, the idea presented in noisy student model that pseudo labels should be as accurate as possible and adding noise to the student model (including the use of data augmentation), is very similar to the proposed method.

    Figure 2 is difficult to understand. The vertical axis is labeled as ‘cumulative distribution,’ but the graph appears to be decreasing rather than increasing.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel unsupervised training paradigm for deep learning based deformable image registration based on the teacher-student model. However, the experimental evaluation raises concerns about the validity of results.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper proposed an unsupervised registration training paradigm based on cyclical self-training. Specifically, this paradigm consists of a feature extraction network and a differentiable optimizer.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The idea is interesting and the result in table 2 and 3 seems promising; (2) The paper is well-structured; (3) This paper compared with sufficient baseline methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    See section 9 for improvements.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducibility is feasible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    (1) A significance test is helpful to understand results in table 2 and 3; (2) “Supp., Fig. 2, demonstrating accurate and smooth displacements”, it is not clear in the supplementary material that the resulted transformation is smooth. (3) Some shapes in Fig. 3 is strange, it might be better to directly show 3D rendering results for better visibility.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Based on the novelty, and the wiring of the paper, the promising results, this paper is okay for acceptance.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The equality of this paper is marginally above the bar of MICCAI, so the reviewer suggest weak accept.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper has received a mixed review with both positive and negative feedback. During the rebuttal phase, the authors are encouraged to thoroughly address all reviewers’ questions, providing additional details, and correcting any potential misinformation.

    • In particular, the authors should focus on clarifying the methodology design to alleviate any confusion raised by the reviewers. This involves explaining the connection between the proposed network components, providing insights into why the proposed network yields improved results, and addressing specific queries related to the network design.

    • In addition, the authors are advised to address concerns regarding the experimental evaluation, such as ensuring consistency between the dataset used in the experiment and the L2R challenge.

    • It is essential for the authors to clearly articulate the novelty and significance of their proposed method in comparison to previous works cited in the review, specifically referencing papers [3] and [20].

    • The authors also need to address carefully the differences between this paper and paper #1168, “A denoised Mean Teacher for domain adaptive point cloud registration”.




Author Feedback

We thank R1-R4 for their thoughtful comments and appreciating our “valuable”(R3)/“interesting”(R1,R4) method, which is “relevant/of high interest to Miccai community”(R1,R2) and shows “convincing results”(R1)/“improvements over SOTA”(R2) compared to “sufficient baseline methods”(R4).

AC,R2: Novelty/Significance, comparison to [3,20] 1) We propose a - to our knowledge - new cyclical self-training paradigm for unsupervised registration. 2) As ground truth is absent and pseudo labels (PLs) noisy, we combine DL-based feature extraction with differentiable optimization to refine PLs iteratively and cyclically, avoiding pitfalls of existing unsupervised methods. 3) We flexibly solve image- and point-based registration, achieving SOTA results on diverse tasks. By contrast, [3,20] train with manual labels, only apply to one modality, and neither include cyclical training nor PL refinement.

AC,R2,R3,R4: Fairness & significance of the abdomen experiment Our data split (train-val split of L2R) is identical to all SOTA methods for a fair comparison and uses the same pre-processed L2R data. A Wilcoxon-signed rank test confirms significant improvements (p<0.001) on all competitors (Tab 2) with public code (SAME is N/A). As our scope is unsupervised registration, numerical results are not directly comparable to most methods (incl LapIRN) of the L2R paper that improve performance with segmentation labels.

AC,R1: Why method yields improved results? Metric supervision (NCC,MIND) relies on shallow features or image intensities and is prone to noise and local minima. By contrast, our supervision through optimization-refined and -regularized PLs promotes learning task-specific features that are more robust to noise. Cyclical learning gradually improves the expressiveness of features and avoids local minima.

Method’s class(R2), keypoint registration(R1), limitation of method to CNNs(R3) Our method is not point-based (R2) but agnostic to input modality, feature extractor g and optimizer h. §1,2 of Sec 2.3 describe h and g for image inputs, our primary focus. Meanwhile, our method can also solve sparse keypoint registration by implementing g as graph net and h as loopy belief propagation (§3, Sec 2.3). Referring to R3, SOTA results on the lung task (Tab 3) show our approach is not limited to CNNs but also excels with a graph net.

AC,R1,R2,R3: Details on method design Given fixed & moving features, the optimizer infers a disp field that minimizes a combined objective of smoothness and feature dissimilarity (details in [20]). This is the basic forward pass in the learning stream. To regularize and finetune PLs and thus improve supervision, we include 3 refinements. 1) Forward-backward consistency: adds computation of the reverse disp field (F to M) and then iteratively minimizes the discrepancy between fields. 2) Additional warp: warps the moving image with the inferred disp field and then repeats the previous steps. 3) Instance optimization: finetunes the final disp field with Adam, jointly minimizing a regularization and feature dissimilarity. We’ll include a formal description of these steps in Sec 2.2/3 and emphasize the contributions in Fig 1(R1). We updated our repo with modularized code(R2).

R2,R3: Meaning of Fig 2 Yes, the y-axis does not show the cumulative distribution P(X<x) but P(X>x) as X represents the DSC (higher is better). We’ll correct label and caption.

R2: Runtime & memory Training only requires 8GB and 90min on a RTX2080.

AC: Subm #1168 vs #1171 The submissions significantly differ in the addressed problem (domain adaptive point cloud registration vs unsupervised registration focusing on images), datasets (lung vessel trees from PVT vs abdomen CT and Förstner keypoints from COPD) and proposed method. Unlike cyclical self-training with feature network+optimizer & PL refinement, #1168 mitigates noisy PLs in the Mean Teacher paradigm by a new filtering strategy and synthesizing training pairs with known displacements.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    After careful consideration of the authors’ rebuttal, all reviewers consistently recommend a weak acceptance of this paper. The authors have adequately addressed the major concerns and questions raised by the reviewers regarding the significance of the proposed methods. Additionally, they have provided clarification on the method and experimental design, enhancing the overall understanding of the work. The authors are strongly encouraged to incorporate all reviewers’ questions and suggestions into a revised version of the paper for its final publication.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The methodology is not clearly written. It seems that the method was motivated by point registration, and thus there are confusing explanations, although this paper is about image registration. Many detailed comments were provided by reviewers, but the rebuttal did not clarify them enough. The proposed method is conceptually similar to the early accepted paper 1168 by the authors, although those papers include different datasets.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I have read the comments and rebuttal. This paper is about unsupervised image registration via cyclical self-training (self-supervised learning method). Most of the concerns raised by the reviewers have been addressed. The authors are suggested to take their comments into consideration if the paper is accepted.



back to top