Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jiayi Zhu, Bart Bolsterlee, Brian V. Y. Chow, Yang Song, Erik Meijering

Abstract

Continual test-time adaptation (CTTA) aims to continuously adapt a source-trained model to a target domain with minimal performance loss while assuming no access to the source data. Typically, source models are trained with empirical risk minimization (ERM) and assumed to perform reasonably on the target domain to allow for further adaptation. However, ERM-trained models often fail to perform adequately on a severely drifted target domain, resulting in unsatisfactory adaptation results. To tackle this issue, we propose a generalizable CTTA framework. First, we incorporate domain-invariant shape modeling into the model and train it using domain-generalization (DG) techniques, promoting target-domain adaptability regardless of the severity of the domain shift. Then, an uncertainty and shape-aware mean teacher network performs adaptation with uncertainty-weighted pseudo-labels and shape information. Lastly, small portions of the model’s weights are stochastically reset to the initial domain-generalized state at each adaptation step, preventing the model from ‘diving too deep’ into any specific test samples. The proposed method demonstrates strong continual adaptability and outperforms its peers on three cross-domain segmentation tasks. Code is available online.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_63

SharedIt: https://rdcu.be/dnwBX

Link to the code repository

https://github.com/ThisGame42/CTTA

Link to the dataset(s)

https://chaos.grand-challenge.org/

https://www.synapse.org/#!Synapse:syn3193805/wiki/217789

https://github.com/liuquande/SAML


Reviews

Review #3

  • Please describe the contribution of the paper

    The paper addresses continual test-time adaptation for medical segmentation. The authors present a pipeline of multiple components to solve the problem: 1) shape-aware learning in the source domain through auxiliary prediction of SDFs and augmentation strategies from domain generalization, 2) an uncertainty-aware multi-task (binary segmentations and SDF) Mean Teacher for continual adaptation to the target domain, 3) stochastic weight reset to avoid catastrophic forgetting/bias towards latest test samples. Evaluation is performed under 3 domain shifts on public datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Continual test-time adaptation is a problem with high clinical relevance, which is rather rarely studied in medical segmentation, to date.
    • The components of the presented pipeline are well motivated. Overall, the paper is easy to follow and understand.
    • A comprehensive comparison with SOTA methods, demonstrating a strong performance of the presented method across three scenarios.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Each component of the presented pipeline is adapted from existing work: predicting SDFs as auxiliary task [22], the uncertainty-aware multi-task Mean Teacher [21], stochastic weight reset [3]. Thus, the technical novelty is quite limited.
    • The ablation experiment does not clearly reveal the most crucial components of the presented pipeline (see detailed comments)
    • The authors attribute the lower performance of CoTTA to the geometric augmentations. How would CoTTA perform with the same augmentations used in the author’s pipeline? Demonstrating that the performance gain over CoTTA is not just due to the augmentations would better highlight the advantages of the proposed pipeline.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors used public datasets and state to release their code. Thus, results should be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • Introdcution: in my view, the authors mix up terms and should more clearly differentiate between source-free domain adaptation (adaptation on unlabeled target training set before inference on target test set) and test-time adaptation (model is directly optimized on a single test image or batch).
    • Ablation experiment: All ablated versions are only slightly worse than the full proposed method, and in most cases still superior to the SOTA competitors. As such, it remains unclear what the decisive component of the method is that yields the improvement over a standard Mean Teacher? What would be the performance of this baseline? And how do the individual components (SFD only, uncertainty only, stochastic restore only) improve the standard Mean Teacher.
    • Evaluation: “We then compared the final performance of each model against their running performance to evaluate their ability for continual adaptation.” –> Why does this indicate the ability for continual adaptation? I agree with the authors that an improved performance of the final model on earlier data is desirable, but the practical use for CTTA is unclear to me because, by task-definition, we are exclusively interested in the online-performance.
    • Fig. 2 could better visualize the decoupled 2-stage training on source and target data. In its current form, it kind of looks like joint source-target-training.

    Typos and minor details:

    • [9] is not a source-free method, and why is the flexibility of [9,10] limited?
    • “allows the source model to perform reasonably in any target domain regardless of the severity of the domain shift.” –> This is a very strong claim, which is difficult to verify in experiments. I recommend the authors to use a weaker formulation.
    • p.2: model weight –> model weights
    • What is the momentum of teacher model?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    On the one hand, continual test-time adaptation is a barely explored task in medical segmentation, and the presented method achieves a strong performance in a comprehensive evaluation. On the other, technical novelty is limited and some parts of the experiments (ablation study and implementation of CoTTA) could be improved. (If available, I would vote for a borderline)

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper combines three strategies for robust domain adaptation. First a robust model is trained with domain generalisation on a source domain to predict segmentation and distance maps for each label. Second, this model is then continually adapted on unseen target domain without supervision by using a student-teacher architecture with an uncertainty-weighted loss. Finally, randomly selected weights of the adapted model are reset at each minibatch to their corresponding values in the source model. This last step avoids biasing the adapted model too much towards the pseudo-supervision obtained on the target domain. The method is then tested on both intra- and inter-modality settings.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -sound and well-motivated method -comprehensive evaluation and ablation studies -numerous and relevant baselines -well-written paper

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No major weakness, but:

    • some clarifications are needed in the methods
    • the paper should clearly acknowledge (a) the existing methods it is building on, and (b) other alternatives to domain adaptation, such as domain randomization strategies.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good reproducibility: the methods are clear enough to be re-implemented (code will be provided anyway upon acceptance) and datasets are publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Some clarifications are needed in the methods:

    • How is the SDF head implemented, is it a separate decoder or does it just use a different last layer.
    • g remains obscure: its purpose and training strategy needs to be explained.
    • How are the Gaussian corruptions sampled for the inputs of the teacher model ? Is the standard deviation always the same or does it vary across mini-batches and/or the K mini-batch examples ?
    • Is Y_t renormalized after multiplying the W_tk and Y_tk ? Maybe this is because only hard segmentations are used for the loss ? But hard Dice loss needs to be motivated, since people usually use soft Dice loss [26].
    • It looks like the l_seg and l_seg^{con} are redundant since in l_seg the teacher segmentation is already weighted for uncertainty.
    • The first term in l_s seems very handcrafted, I’m not sure about the generalizability of this term.
    • Do you change the learning rate during adaptation ? Because a 1e-3 learning rate seems very large for just fine-tuning.
    • It is unclear which part of the second module is removed for the “uncertainty” ablation. Is it the uncertainty weighting, the whole student/teacher architecture, etc. ?

    I think the paper is very interesting, but it should openly acknowledge in the contributions the existing methods it is building on. The first module is basically [19], except that it adds distance maps proposed in [22] and many other papers. The second module builds on [21] and adds a novel uncertainty weighting mechanism. And finally, the weight restoration is entirely taken from [3].

    The introduction is well-written but some important methods are missing. -The authors should acknowledge recent works domain randomization like SynthSeg (Billot et al., SynthSeg: Segmentation of brain MRI scans of any contrast and resolution without retraining, 2021), which was succesfully used for cross-site, cross-modality, and cross-resolution brain and cardiac segmentation with no retraining/fine-tuning.

    • Shape-encoding in TTA is not new, and the authors you cite Karani et al., (Test-time adaptable neural networks for robust medical image segmentation, 2021), where a shape denoiser is used to correct test-time segmentations in order to get surrogate ground truth.

    Also in the introduction, unsupervised domain adaptation methods are not “earlier works”, but are an active ongoing area of research. For proof, the cited papers [6-8] are as recent as other papers cited for TTA [9-10].

    I’m a bit skeptical about the performance of CiDG in cross-modality scenarios, since supervised network are typically shown to completely collapse in large domain shifts, such as CT-T2 adaptation, even with aggressive data augmentation.

    Finally, font-size in Table 1 is exaggeratedly small, I suggest removing one or two columns from the prostate experiment for better visualization.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think this is an interesting paper, which shows that TTA can finally be deployed for cross-modality scenarios. I’m happy to improve my rating if the authors clarify some aspects of the methods, and properly acknowledge related existing methods.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    In this paper the authors propose a continual test-time adaptation framework in that adapts a source-trained model to a target domain. The framework has three main components:

    • A shape aware model training component that employs augmentation techniques from reference [19]
    • Teacher/student component that adapts the model weights to the target domain
    • A component that resets a subset of the network parameters to its initial weights to avoid “catastrophic forgetting”

    Empirical results that demonstrate the performance improvements compared to state-of-the-art are provided.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Solid empirical validation and demonstrated improvements compared to state of the art methods
    • Thorough literature review, well structured paper
    • Analysis provides experimental insights related to domain adaptation in medical image analysis
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • More details should be provided regarding some parts of their framework (like for example the description of the student/teacher component))
    • There are more elaborate ways to estimate uncertainty of Deep Learning models beyond performing K passes using additive Gaussian noise.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The framework appears to be reproducible using the information provided in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The proposed framework and the empirical validation provides useful insights regarding domain adaptation in medical image analysis and the components that are needed in order to get good performances.

    The authors should provide more details about the student/teacher framework. More precisely, the way that the model weights of the “mean teacher” network are updated is not fully clear. The “mean teacher” is also not clearly illustrated in Figure 2.

    The authors should explore the potentials of more elaborate methods to estimate uncertainty for their base model predictions. It would be interesting to see whether more elaborate methods improve performance for domain adaptation.

    In the experiment section, the authors should clarify further this sentence: “The model was empirically updated for two steps per test batch for prostate and muscle segmentation and 10 steps for abdominal segmentation.”

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Solid empirical validation and demonstrated improvements compared to state of the art methods
    • Analysis provides experimental insights related to domain adaptation in medical image analysis
  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes a framework for continual test-time adaptation for medical image segmentation, which combines three strategies for robust domain adaptation. The proposed method is well-motivated and is supported by a comprehensive evaluation and ablation studies against numerous and relevant baselines. The strengths of the paper include the empirical validation, thorough literature review, and well-structured paper, while the weaknesses include the need for more clarifications in the methods, proper acknowledgment of related existing methods, and improvements in the ablation study and implementation of CoTTA. Reviewers 1 and 2 recommend acceptance, with some suggestions for improvements, while Reviewer 3 rates the paper as a weak accept, citing limited technical novelty and the need for some improvements in the experiments. Overall, taking into account the positive assessments and recommendations of all three reviewers, the paper is recommended for provisional acceptance.




Author Feedback

We would like to first thank all reviewers and meta-reviewer for their kind and constructive feedback. In the camera-ready version of our paper, we will do our best to ensure that (1) more clarifications are provided for each component of our method, (2) the existence of alternative methods to domain adaptation is adequately mentioned, and (3) methods that our framework is built on are more clearly acknowledged. As the space is quite limited for MICCAI papers, we will improve and expand our introduction, literature review, experiments, and ablation study sections in the future journal version of our manuscript to incorporate all the feedback we have received.



back to top