Authors

Mathilde Bateson, Herve Lombaert, Ismail Ben Ayed

Abstract

Supervised learning is well-known to fail at generalization under distribution shifts. In typical clinical settings, the source data is inaccessible and the target distribution is represented with a handful of samples: adaptation can only happen at test time on a few (or even a single) subject(s). We investigate test-time single-subject adaptation for segmentation, and propose a Shape-guided Entropy Minimization objective for tackling this task. During inference for a single testing subject, our loss is minimized with respect to the batch normalization’s scale and bias parameters. We show the potential of integrating various shape priors to guide adaptation to plausible solutions, and validate our method in two challenging scenarios: MRI-to-CT adaptation of cardiac segmentation and cross-site adaptation of prostate segmentation. Our approach exhibits substantially better performances than the existing test-time adaptation methods. Even more surprisingly, it fares better than state-of-the-art domain adaptation methods, although it forgoes training on additional target data during adaptation. Our results question the usefulness of training on target data in segmentation adaptation, and points to the substantial effect of shape priors on test-time inference. Our framework can be readily used for integrating various priors and for adapting any segmentation network, and our code is available anonymously.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_70

SharedIt: https://rdcu.be/cVRwX

Link to the code repository

https://github.com/mathilde-b/TTA

Link to the dataset(s)

http://www.sdspeople.fudan.edu.cn/zhuangxiahai/0/mmwhs/data.html

https://liuquande.github.io/SAML/

Reviews

Review #1

Please describe the contribution of the paper

The authors propose a method to adapt a segmentation DNN using just the test data during inference. The adaptation is using a single subject. A UNet based DNN is first trained on the source domain using standard cross entropy loss. During test time, only the network’s batchnorm parameters are adapted on the target subject using a combination of shannon entropy and shape descriptor losses (Shannon entropy for high confidence predictions, KL div between class ratios, Soft penalties on centroid and distance-to-centroid descriptors). The method is applied to two different scenarios:MRI to CT adaptation for cardiac segmentation (MMWHS data 20each), and cross-site adaptation for MRI prostate segmentation(NCI-ISBI Challenge T2-weighted MRI 30 each). The metrics used in results are 3D DSC and ASD. The method is compared against TTA, DA and SFDA methods. The proposed methd performs better than other methods for the DSC metric in both the cardiac segmentation and prostate segmentation scenarios.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well written and easy to understand. The authors use publicly available datasets to validate their approach. The validation benchmark includes the entire range of methods : From NoAdap (lower bound) to Oracle (Upper bound). The proposed approach has been compared against methods that include TTA, DA and SFDA.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

-Minor spelling mistake - 3.1 Adaption instead of Adaptation. -The Shannon entropy expression in page 4: Did the authors mean to put the class weights inside the summation over ‘k’? -Compared to ref [2] and [20], the novelty is limited to inference on the test subject and use of soft constraints on centroid and distance-to-centroid shape descriptors. -How does the approach do when the DNN is adapted using the proposed loss function on the entire target domain data? -There is a large gap between the oracle and the proposed method.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

-Details about the data preprocessing and the data splits are provided in the paper. -Details on training are provided in “Training and implementation details” section.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The paper uses shape descriptor based losses to perform TTA of a segmentation model that is trained only on the source domain. This is a good way to add shape priors to the model, while spending significant training time only on the source domain dataset. The benchmarks are good and show the improvements due to the method. One thing that is missing here is - How does the approach do when the DNN is adapted using the proposed loss function on the entire target domain data? This would show how much better the proposed method is compared to [2]. Why was this not done?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper addresses a relevant problem of TTA using shape descriptors for adaptation. The amount of data available in the target domain could be low, and such methods are useful in those scenarios. The authors add minor changes to what is already proposed in ref [2] and [20]. Although the results perform better compared to other methods on the DSC metric, it is still significantly below the Oracle. It also does not do that well on the ASD metric on most of the cardiac classes and the prostate.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The authors present a single subject test-time domain adaptation approach for image segmentation. The method is based on the entropy minimization of the target image’s softmax prediction, introduced by Wang et al. and further extended by the authors to consider three different shape moments (class-ratios, class-centroids and class-centroid-distances) into the adaptation process. Related work implementations and results with ablations are presented on an MRI-CT multi organ cardiac and a cross-site prostate segmentation dataset. Overall, the authors could show significant performance increases by incorporating the class-ratio and class-centroid moments regarding the Dice score on both datasets and regarding the average surface distance on the MRI-CT task.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The major strength of the presented test-time domain adaptation method is the increase in performance over classical and source free domain adaptation methods, which is shown to be achieved by the addition of the introduced shape moments.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

There are two major concerns regarding the work. First, the presentation of the mathematical aspects of the approach is inconsistent and accordingly at parts very hard to follow. There are no major issues, however multiple inconsistencies sum up and make it hard to build an intuition of the general idea of the work (see section with detailed constructive comments). Second, it seems that the chosen datasets are well suited for the incorporation of the defined shape priors, especially the centroid. It might be a good idea to optimize the center of mass of a class within a slice of the image towards the estimated center of mass defined by the whole 3D image for roundish classes (as they appear in the two datasets), however, this is not necessarily true if there are i.e., classes with elongated shapes diagonally covering the whole image domain. Intuitively, the matching of the distance to the centroid from one slice to the whole stack of slices is probably more robust in that regard, the authors could probably comment on the choice of Moments in their discussion, why and where they are expected to work well?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The datasets are publicly available, the code as well, with the correction of a few definitions in the description of the method, the work should be reproducible. However, as the method is running an optimization process at test time it would be good to include the details of the used hardware and the measured runtime in order to know what the expected inference time for a sample of the target domain is.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

We found the following mathematical inconsistencies:

The index variable n is used first at page 3 in the Method section to identify an image I_n, whereas it is only defined later (page 4) as the index variable identifying a slice of the target image, therefore in equation 2 it is unclear what the domain omega_n should be (which by itself is also never defined).

Equation page 3 bottom: u and v prime are not defined, are they the components of the centroid as defined at the top of page 4?

Table 1, Class-Ratio: what is the domain omega_T in this equation, should it be the lower-case t (target domain) defined a little later?

The weights ny_k in the formulation of the weighted shannon entropy should probably be after the summation symbol, not before?

Generally, the notation of scalars, vectors and matrices is not consistent. First, bold fonts are used, later i.e., for the mu after it is stated to be vectorized not anymore.

Result presentation:

Table 2+3: The Dice score is unitless. The ASD should be converted to mm which allows a much better comparability over different datasets. It is unclear why TTAS_RC and _RD are proposed methods but the TTAS_R under the heading of ablation study? The ablation with all three moments would be interesting as well?

Minor linguistic and structural inconsistencies are: [..] variations in image modalities … without in [..] Standard DA methods, such as [18,17,5,22,18] … duplicated reference and order [..] is unavailable during training … available during training Page 4, subsection Test-Time adaptation and inference … it is described how the centroid and distance to centroid moments are estimated but at this point it is unclear how the class ratio is estimated.
Page 6, estimating the shape descriptors … it would probably be much cleared to shortly describe how R prime is created instead of referencing to [1]

Other: I wonder how the results would change if a Dice score-based loss is added to the CE loss in the pre-training on the source domain and the oracle as it is standard in state-of-the-art image segmentation to use a combination of both.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major increase of performance using the introduced shape moments, even over domain adaptation methods which are specifically trained on datasets of the target domain outweighs the weaknesses of the work. The mathematical inaccuracies can be addressed and comments regarding the performance on specific datasets and the selection of the given shape moments can be included.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper proposed a simple formulation for source-free and single-subject test-time adaptation of segmentation networks, and demonstrated its performance in MRI-to-CT adaptation and cross-site adaptation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This work proposed a shape-guided entropy minimization objective with shape moments to achieve the task of test-time single-subject adaptation. From the experiment section, the proposed method exhibits better performance compared with other methods.
2. The chosen datasets in section 3.1 are appropriate to exhibit the performance of the proposed method, by dealing with MRI-to-CT adaptation and cross-site adaptation.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. It is hard to understand the main idea of this work for the first time.
2. Compositions of Figures 1 and 2 need to be improved.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors provide source codes for this work and the main datasets used are from challenges. Therefore, the reproducibility of this paper can be achieved to a certain extent. But, it is better to provide more information, like the description of system dependencies, instructions to train and test models, and data preprocessing.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

It is hard to understand the main idea of this work for the first time. The authors should use more detailed and organized descriptions to state what are shape moments and how to use them. By the way, I am not sure whether the citation of [14] (section 2) is correct or not. I did not find the explanation of “Shape moments” in [14].
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This work proposed a shape-guided entropy minimization objective to improve the segmentation performance in target domains. The experiments show the proposed work exhibits better performance.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Somewhat Confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
The authors propose a method to adapt a pretrained segmentation network to the test data during inference time. All the reviewers agree that this is a very interesting work and ideas are novel.

In addition to fixing the minor detailed problems pointed out by the reviewers (like problems with the equations), the authors are encouraged to think over the following points and potentially extend their work.
1. “How does the approach do when the DNN is adapted using the proposed loss function on the entire target domain data?” Given that the work is for test time adaptation rather than single case test time augmentation, this is a very valid question.
2. “the authors could probably comment on the choice of Moments in their discussion, why and where they are expected to work well?” This seems to be a limitation of the presented work, which warrants further discussion.
3. The logical flow of the presentation can be improved to help readers under the main idea and intuition behind the work.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

2

Author Feedback

We thank the Reviewers/Meta-Reviewer for their constructive comments. We are happy that all the reviews pointed to the novelty of our test-time formulation and its great interest to MICCAI, and to the strong comparisons with several recent state-of-the-art domain adaptation methods. In the following, we address a misunderstanding by R1 (comparability to [2]), discuss the few weaknesses raised in the reviews, and clarify a few additional points.

– Novelty (R1): We would like to highlight the novelties of our work: 1/ we introduced a method for test-time adaptation for image segmentation, which does not need access to the source data, nor the availability of a target training data. 2/ Our method leverages various shape moments, which we integrated in the loss function. Our method uses a single segmentation network, and is, therefore, simpler to optimize than most TTA and DA methods. For example, the recent TTA method in [8] uses an additional denoising autoencoder. 3/ We showed the efficiency of our method for the adaptation of a segmentation model to a new domain with a single subject. – Comparability to [2] (R1): We would like to clarify that our method and [2] (as well as all other methods presented) have been tested on the exact same subjects, i.e. the Test set of the Target Dataset (referred to as Te in our paper), in both applications. The only difference with [2] is that in our method, a per-subject adaptation of the model is performed. On the contrary, methods such as [2] follow the classical domain adaptation (DA) setting: a model is first adapted on a target training data (Tr), with hyperparameters chosen using a validation data (Tv), and results are presented on an independent test set (Te). In Table 2 and Table 3 and in Figure 2, quantitative and qualitative results are shown on the test set (Te). It is precisely to be able to compare TTA and DA methods that TTA methods were evaluated on Te. In an extended version of this work, we will further evaluate results on the whole target domain data (Tr+Tv+Te) as suggested by R1. – Large gap with the Oracle (R1): we would like to highlight that this gap ( ~9% DSC in both applications) is to be expected. Indeed, the setting of the problem is extremely challenging, as we tackle one-subject adaptation to domain shift. Furthermore, we would like to point out that the gap with Oracle is larger or equivalent for Domain Adaptation and Source-Free Domain Adaptation methods, all of which have access to more data. – Suitability of our method to roundish versus elongated shapes (R2): we agree with R2 and will comment in the discussion on the expected performance depending on the shape of the structures of interest. However, we would like to point out to R2 that our method was shown to improve the model’s performance when segmenting both roundish shapes (prostate, Myocardium), but also a tubular shape (Aorta), both using the centroid and the distance to centroid. – Comments regarding the selection of the shape moments (R2): we agree with R2 and have included comments on the choice of shape moments. – The Shannon entropy expression (page 4) (R1, R2, R3): As the reviewers noted, the class weights should be inside the summation over ‘k’. We will correct this and thank the reviewers for their careful attention. – Notation inconsistencies (R2): we have addressed the notation inconsistency regarding the target domain Omega_T. – Discussion of shape moments in the citation of [14] (R3) : We would like to point R2 to Table 1, page 9 of [14], and to Section 4.8, which references works using shape moments. It is therefore very relevant to refer to [14] when discussing prior computer vision work introducing shape moments. We thank the Reviewers/Meta-Reviewer for their constructive comments. We are happy that all the reviews pointed to the novelty of our test-time formulation and its great interest to MICCAI, and to the strong comparisons with several recent state-of-the-art domain adaptation methods. In the following, we address a misunderstanding by R1 (comparability to [2]), discuss the few weaknesses raised in the reviews, and clarify a few additional points.

– Novelty (R1): We would like to highlight the novelties of our work: 1/ we introduced a method for test-time adaptation for image segmentation, which does not need access to the source data, nor the availability of a target training data. 2/ Our method leverages various shape moments, which we integrated in the loss function. Our method uses a single segmentation network, and is, therefore, simpler to optimize than most TTA and DA methods. For example, the recent TTA method in [8] uses an additional denoising autoencoder. 3/ We showed the efficiency of our method for the adaptation of a segmentation model to a new domain with a single subject. – Comparability to [2] (R1): We would like to clarify that our method and [2] (as well as all other methods presented) have been tested on the exact same subjects, i.e. the Test set of the Target Dataset (referred to as Te in our paper), in both applications. The only difference with [2] is that in our method, a per-subject adaptation of the model is performed. On the contrary, methods such as [2] follow the classical domain adaptation (DA) setting: a model is first adapted on a target training data (Tr), with hyperparameters chosen using a validation data (Tv), and results are presented on an independent test set (Te). In Table 2 and Table 3 and in Figure 2, quantitative and qualitative results are shown on the test set (Te). It is precisely to be able to compare TTA and DA methods that TTA methods were evaluated on Te. In an extended version of this work, we will further evaluate results on the whole target domain data (Tr+Tv+Te) as suggested by R1. – Large gap with the Oracle (R1): we would like to highlight that this gap ( ~9% DSC in both applications) is to be expected. Indeed, the setting of the problem is extremely challenging, as we tackle one-subject adaptation to domain shift. Furthermore, we would like to point out that the gap with Oracle is larger or equivalent for Domain Adaptation and Source-Free Domain Adaptation methods, all of which have access to more data. – Suitability of our method to roundish versus elongated shapes (R2): we agree with R2 and will comment in the discussion on the expected performance depending on the shape of the structures of interest. However, we would like to point out to R2 that our method was shown to improve the model’s performance when segmenting both roundish shapes (prostate, Myocardium), but also a tubular shape (Aorta), both using the centroid and the distance to centroid. – Comments regarding the selection of the shape moments (R2): we agree with R2 and have included comments on the choice of shape moments. – The Shannon entropy expression (page 4) (R1, R2, R3): As the reviewers noted, the class weights should be inside the summation over ‘k’. We will correct this and thank the reviewers for their careful attention. – Notation inconsistencies (R2): we have addressed the notation inconsistency regarding the target domain Omega_T. – Discussion of shape moments in the citation of [14] (R3) : We would like to point R2 to Table 1, page 9 of [14], and to Section 4.8, which references works using shape moments. It is therefore very relevant to refer to [14] when discussing prior computer vision work introducing shape moments.

back to top

Test-Time Adaptation with Shape Moments for Image Segmentation