Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Wenao Ma, Cheng Chen, Shuang Zheng, Jing Qin, Huimao Zhang, Qi Dou

Abstract

Class distribution plays an important role in learning deep classifiers. When the proportion of each class in the test set differs from the training set, the performance of classification nets usually degrades. Such a label distribution shift problem is common in medical diagnosis since the prevalence of disease vary over location and time. In this paper, we propose the first method to tackle label shift for medical image classification, which effectively adapt the model learned from a single training label distribution to arbitrary unknown test label distribution. Our approach innovates distribution calibration to learn multiple representative classifiers, which are capable of handling different one-dominating-class distributions. When given a test image, the diverse classifiers are dynamically aggregated via the consistency-driven test-time adaptation, to deal with the unknown test label distribution. We validate our method on two important medical image classification tasks including liver fibrosis staging and COVID-19 severity prediction. Our experiments clearly show the decreased model performance under label shift. With our method, model performance significantly improves on all the test datasets with different label shifts for both medical image diagnosis tasks. Code is available at https://github.com/med-air/TTADC.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_30

SharedIt: https://rdcu.be/cVRtg

Link to the code repository

https://github.com/med-air/TTADC

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a new method, named Test-time Adaptation with Calibration (TTADC), to handle arbitrary unknown test label distribution shifts, which deal with the practical issue of test class proportion being different from that of the training set. Specifically, TTADC first trains K representative one-dominating-class classifiers during training, followed by adaptive Test-time aggregation of that K classifiers exploiting augmentation consistency. Experiments on real-world medical diagnosis tasks demonstrate the eﬀectiveness of the proposed TTADC.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

A novel method for test-time adaptation, i.e., TTADC, is proposed, with strong and convincing empirical evaluations. Experiments on real-world clinical data demonstrate clinical feasibility. The paper is well written with clear logic.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The clarity associated with the derivations, especially on Eqs 1-3, should be significantly improved. The underlying assumption on multiple binary classifcations with sigmoid function is not discussed in detail.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Code is not available yet. But the TTADC is expected to be readily reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. Please highlight the assumption that all test data samples are accessible simultaneously.
2. It’s assumed that, after converting multi-class classification into multiple binary classifications, the resulting binary classifications are conditionally independent. This is a strong assumption that need careful justification. Please elaborate on that.
3. The proof of Eq.(1) is not convincing, because the connections/assumptions between y_i and y_{-i} (given x) are completely missing. Will Eq (1) still hold? Under what assumptions?
4. The clarity on Eqs 1-3 should be significantly improved. For example, the main idea is not clearly stated; the terms related to “expected” or “training” can be confusing. Eqs (1) and (3) are not easy to understand without the supplementary materials.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The presented techniques are novel and likely useful to many practical applications.
Number of papers in your stack

2
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper presents the first method to tackle label shift for medical image classification, which effectively adapt the model learned from a single training label distribution to arbitrary unknown test label distribution. Note that the label distribution means the proportion of each class. The contributions can be summarized into aspects:

C1: This paper innovates distribution calibration to learn multiple representative classifiers, which are capable of handling different one-dominating-class distributions.

C2: When given a test image, the diverse classifiers are dynamically aggregated via the consistency-driven test-time adaptation, to deal with the unknown test label distribution.

C3: The authors validate our method on two medical image classification tasks including liver fibrosis staging and COVID-19 severity prediction.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

S1: This paper presents the first work to effectively tackle the label distribution shift in medical image classification.

S2: The paper is overall clearly structured.

S3: Authors have tested the proposed method on two public datasets to demonstrate the effectiveness of the proposed method and have conducted extensive ablation studies.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

W1: Although the authors have elaborated the significance of the tackled problem: label distribution shift, I cannot grasp the significance of solving the label distribution shift issue if there is no long-tail problem in training dataset or there is no domain gap between training and test datasets. Since we usually conduct inference for test image one by one, the final classification performance is not sensitive to the label distribution of test data.

W2: As illustrated in the experimental section, the training and test data are from different centers with domain gap. What is the difference between the tackled label distribution shift problem and widely investigated domain adaptation tasks, such as domain adaptation and test-time adaptation?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Authors claim that code will be available after review.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

It is suggested that the authors should provide the comparison between the tackled label distribution shift problem and long-tail problem. Moreover, the comparison between the tackled problem and test-time adaptation task should also be discussed.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My recommendation is based on my concerns in 4 and 5. I am willing to upgrade my score if authors can address my concerns in 5, that is, clarify the significance of addressing the label distribution shift issue.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The authors propose a framework to apply test time adaptation to solve label distribution shift in the context of CNN for medical images. The method is composed by two phases: 1) training different models that are able to handle different labels distributions 2) aggregating these classifiers during test time in order to solve the labels shift problem
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Well-conducted research. The paper is well written and the research is well conducted. every choice is well documented and results are convincing and presented in a proper way (mean and std)
- different tasks are evaluated (Liver fibrosis staging and Covid19 severity prediction)
- Ablations showed that their proposed components are actually improving results
- Results. They compared with different SOTA models (even if these SOTA methods were not built with a medical setting in mind) and achieved the best results
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The difference with [34]. Please add information on how much and in what this method differs from the cited works as the provided explanation (1 line with no clear info) is no enough in order to compare them
- 2.3 “…with the implicit knowledge of label distribution on the test set” can give more insights about this information? I see from the test time adaptation loss that label information is not present (and I suppose that this information cannot be used) but this sentence leaves me with doubts
- The advantages of test time adaptation are not clear to me in the medical setting. If the authors can clarify this aspect in the paper in the rebuttal would be beneficial. Why do I need to adapt at test time? How your model works when trained on a balanced train and test sets and compared to a classical training procedure? I think that one of the major barriers to applying Ai in medicine is that models are not still effectively explainable. I don’t know how a model that adapts itself at inference time could be helpful in the medical community.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

everything is ok
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

I find the paper valuable and well written, but I can’t see the advantage of test time adaptation in a medical setting. Let me explain better with a simple example. COVID detection or severity like the example in the paper: I would like to have a model that, given a single CT scan image, can detect if covid is present, the severity and the areas that lead to that decision without influencing how many other positive cases. I suppose that the goal is to train models that can extract features related to the pathology. Why my model will be influenced at test time if I trained it in a balanced manner?. Maybe you can explain better in the paper or in response to that review what is the advantage.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My rank is primarly based on doubts on the value for the medical setting of this work. Anyway the work is valuable from all the others point of views and so I decided to leave a weak accept in order to wait for a clarification
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The authors propose a framework to deal with label distribution shift (i.e. proportion of each class) in medical images with CNNs. This is achieved by training multiple classifiers, which are capable of handling different one-dominating-class distributions and are dynamically aggregated via consistency-driven test-time adaption during testing. The method is is evaluated on two medical image classification tasks.

This paper is one of the first to tackle label distribution shift in medical image classification, is well-written and well structured. The results are presented properly (including mean+stddev) and are evaluated on two datasets. An ablation study is included and the method is benchmarked against other SOTA methods that haven’t been applied to the medical domain yet.

The reviewers point out some weaknesses, mainly that the clinical use of a test-time adapting method is limited in terms of explainability. For future work, a report of the model performance on a balanced training and testset compared to a standard model would be interesting to see.

As the strengths clearly outweigh the weaknesses, this paper is to be accepted.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

3

Author Feedback

We thank the AC and reviewers for their valuable time and comments. We would like to clarify the meaning and significance of tackling label distribution shift in medical applications and the difference between the label distribution shift and data distribution shift/long-tail problem.

We focus on label distribution shift in medical datasets because this problem is very common in medical diagnosis with a machine learning model and such shift often degrades the performance of a learned classifier on test data, leading to erroneous predictions as observed in prior work [1-3]. This is because the optimal classifier could change under label shift, which can be explained with Bayesian inference as introduced in Section 2.1 of our paper. A typical example is for flu prediction, if a model is trained on data collected from regular morbidity rate, the model would perform worse with increasing false negative rate when being tested during the period of flu outbreak [4]. The label distribution shift problem has been investigated on datasets of natural images [4-6] but has not yet been tackled on medical datasets. In Section 3.2 of our paper, we clearly show the observation of label distribution shift problem in the multi-center liver CT datasets by comparing between the ROI segmentation and liver fibrosis classification performance on the evaluation set of training dataset and two test datasets with label shift. With our test-time adaptation from distribution-calibrated classifiers, the model performance on test datasets has been significantly improved as shown in Table 1.

Label distribution shift and data distribution shift are two types of dataset shift, as introduced in previous work [7,8]. For label shift, the label distribution p(y) changes between the training and test data, while the conditional distributions p(x/y) are the same. Data distribution shift is the opposite situation, where the conditional distribution p(x/y) is shifted due to different imaging conditions, while the label distributions p(y) are the same. The experimental results shown in Fig. 2 demonstrate that there are no data distribution shift between the training and test sets of the liver CT data, but the label shift leads to decreased model performance.

The long-tailed visual recognition can be considered as an instance of label shift problem [8], where a model is assumed to be trained on a long-tailed dataset and tested on a uniform target label distribution. However, the degradation of model’s performance is because of the label shift between the training and test data instead of the long-tailed training data distribution. Even if the training data label distribution is uniform, the model performance would decrease when the test data label distribution is shifted.

Reference

Challen, R., Denny, J., Pitt, M., Gompels, L., et al.: Artificial intelligence, bias and clinical safety. BMJ Quality & Safety 28(3), 231–237 (2019)

Chen, I.Y., Joshi, S., Ghassemi, M., et al.: Probabilistic machine learning for healthcare. Annual Review of Biomedical Data Science 4, 393–415 (2021)

Davis, S.E., Lasko, T.A., Chen, G., et al.: Calibration drift in regression and machine learning models for acute kidney injury. Journal of the American Medical Informatics Association 24(6), 1052–1061 (2017)

Guo, J., Gong, M., Liu, T., et al.: Ltf: A label transformation framework for correcting label shift. In ICML, 3843-3853 (2020)

Lipton, Z., Wang, Y.X., et al.: Detecting and correcting for label shift with black box predictors. In ICML, 3122-3130 (2018)

Wu, R., Guo, C., Su, Y., et al.: Online adaptation to label distribution shift. NeurIPS, 34 (2021)

Subbaswamy, A. and Saria, S.: From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics, 21(2), pp.345-352 (2020)

Hong, Y., Han, S., et al.: Disentangling label distribution for long-tailed visual recognition. In CVPR, pp. 6626-6636 (2021)

back to top

Test-time Adaptation with Calibration of Medical Image Classification Nets for Label Distribution Shift