Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Qiushi Yang, Xinyu Liu, Zhen Chen, Bulat Ibragimov, Yixuan Yuan

Abstract

Semi-supervised learning (SSL) for medical image classification has achieved exceptional success on efficiently exploiting knowledge from unlabeled data with limited labeled data. Nevertheless, recent SSL methods suffer from misleading hard-form pseudo labeling, exacerbating the confirmation bias issue due to rough training process. Moreover, the training schemes excessively depend on the quality of generated pseudo labels, which is vulnerable against the inferior ones. In this paper, we propose TEmporal knowledge-Aware Regularization (TEAR) for semi-supervised medical image classification. Instead of using hard pseudo labels to train models roughly, we design Adaptive Pseudo Labeling (AdaPL), a mild learning strategy that relaxes hard pseudo labels to soft-form ones and provides a cautious training. AdaPL is built on a novel theoretically derived loss estimator, which approximates the loss of unlabeled samples according to the temporal information across training iterations, to adaptively relax pseudo labels. To release the excessive dependency of biased pseudo labels, we take advantage of the temporal knowledge and propose Iterative Prototype Harmonizing (IPH) to encourage the model to learn discriminative representations in an unsupervised manner. The core principle of IPH is to maintain the harmonization of clustered prototypes across different iterations. Both AdaPL and IPH can be easily incorporated into prior pseudo labeling-based models to extract features from unlabeled medical data for accurate classification. Extensive experiments on three semi-supervised medical image datasets demonstrate that our method outperforms state-of-the-art approaches. The code is available at https://github.com/CityU-AIM-Group/TEAR.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_12

SharedIt: https://rdcu.be/cVRYP

Link to the code repository

https://github.com/CityU-AIM-Group/TEAR

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This work proposes a new framework TEAR for semi-supervised medical image classification, which contains an AdaPL module for relaxing hard pseudo labels to soft-form ones and an IPH module for aligning feature prototypes across different training iterations.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The designed AdaPL to mitigate confirmation bias is based on a theoretically derived loss estimator.

    The proposed IPH to encourage the harmonious clustered prototypes across different training iterations works in an unsupervised way.

    The paper is well written.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The proposed IPH exploits cluster-aware information that conducts an unsupervised clustering method (e.g., K-Means) within a mini-batch, so the batch size is vital to IPH. I guess the experimental results are sensitive to batch size, which, however, is not discussed in the paper.

    For ISIC dataset, the authors seem to copy the comparison results of GLM [36] and NM [22] from NM, which is an unfair comparison because the proposed method adopts a strong augmentation for training data.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The implementation details are enough for the reproducibility of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    An ablation analysis on batch size is needed to verify whether the results are sensitive to it.

    It’s better to visualize the cluster across different training iterations. This will help the authors understand how the proposed IPH module works.

    A fair comparison between the proposed method and other medical image classification methods [22, 36] should be conducted.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The experiments lack of key ablation analysis and the comparison experiment is somewhat unfair.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    This paper presents a semi-supervised image classification method with adaptive pseudo labelling and iterative prototype consistency. The adaptive pseudo labelling uses a loss estimating function to soften and calibrate hard pseudo labels while the iterative prototype consistency aligns the cluttered class centroids across model training iterations to reduce dependency of pseudo labels. The author conducts experiment on three datasets and the proposed method outperforms existing state-of-the-arts algorithms. Also, the author provides detailed ablation results to verify the effectiveness of their method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • the idea of using loss estimating function across training iterations to calibrate pseudo labels is interesting also reasonable. The author provides theoretical proof of the feasibility of the loss estimating function.

    • the proposed method achieves SOTA results on three datasets. And the author provides sufficient ablation results to demonstrate each components and effect of hyper-parameters.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • the IPH compute clustered feature centroids within each mini-batch which makes it a little bit unconvincing. Because the batch class mean is likely to be a poor approximation of the real mean. Also, the batch size may affect the clustering result. when not all classes are present in a batch, the clustered features cannot reflect the correct class distribution and thus may bring new problems.

    • the author did not show how the metric which measures the consistency of clustered feature centroids changes over the training iterations. For instance, a figure show the distance of centroids across iteration before and after using IPH might help illustrating the mechanism of IPH.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    the author provides sufficient implementation details which helps reproduce the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I appreciate the thoroughness of the analysis (different datasets, several state of the art methods, ablation study). The organization and presentation of the paper is easy to follow. But the paper lacks qualitative analysis to demonstrate the mechanism behind the components (IPH) of proposed method. It would be excellent if the author can provides additional analysis on this point in their future work.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. good technique contribution
    2. good performance on three datasets
    3. the paper is well-organized and easy to follow
  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The author provided detailed experiment to show the effect of batch size, and quantitatively assessed the clustering quality of the proposed IPH module. The additional experiment results well addressed my concern about the effectiveness of the IPH module. Thus, I will keep my decision unchanged and recommend to accept this paper.



Review #4

  • Please describe the contribution of the paper

    This paper proposes a TEmporal knowledge-Aware Regularization (TEAR) for semi-supervised medical image classification. The upper bound of the loss of unlabeled samples was theoretically proved and it was used to relax pseudo labels (soft pseudo labels). In addition, the iterative prototype harmonizing (IPH) is proposed to maintain the harmonization of clustered prototypes across different iterations. In the experiments, the proposed method outperformed the state-of-the-art methods and sufficient ablation studies were conducted.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    *The theoretical analysis for an upper bound of a loss of unlabeled sample was conducted and it is used to soften the pseudo labels. This is interesting.

    *To match the feature prototypes across different training iterations, IPH is proposed, which takes the advantage of the knowledge from different training iterations and provides the coherent optimization.

    *The evaluation is sufficient. It contains the comparison with a sufficient number of compared methods, ablation studies, and the hyper-parameter sensitivity. This shows the effectiveness of the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    *Some of the explanations are unclear.

    • in Eq.5, \sigma is the normalization function. However, $u$ is a single sample (LE(u;t) is a scalar). How to normalize it? It indicates using all the unlabeled data? The current description is a bit ambiguous.
    • In the comparison and ablation study, what value of ‘b’ in Eq.5 was used?

    *The comparison is O.K. but, to show the effectiveness of the proposed method, it is better to compare with the semi-supervised methods that use training iteration information, such as mean-teacher algorithm.

    *Once carefully read the process of the method, the reason why the method is effective is understandable. However, it is not easy to understand it by reading the introduction. The reviewer recommends the authors to more clarify it, i.e., show the simple examples what case the relaxation is large or small and why.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility is fine.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    please see above comments.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As described in strength, the paper organization is good, and the proposed method is well designed, and the evaluation is basically sufficient for MICCAI. Therefore, my rating tends to ‘acceptance’.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The rebuttal addressed almost my concerns. In addition, I read all the other reviews. I feel that the negative rating by R1 was not critical and almost concerns were addressed. Therefore, I keep my rating is ‘accept’.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All reviewers commented positively on the theoretical derivations in the paper and the clarity of the writing. Reviewers in general also thought that the paper made strong technical contributions and had sufficient experimentation. Evaluation on three different datasets with different modalities is a strength.

    One of the reviewers had concerns about the effect of batch size on the clustering and lack of ablation studies to understand its effect which led them to give a reject rating. Another reviewer had the same concern with batch size but thought of it as a minor issue. The authors should address the criticisms brought up by the reviewers especially those concerning batch size.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4




Author Feedback

We sincerely thank the meta reviewer and all reviewers for their time and efforts. Below please find the responses to specific comments.

MR & R1Q1 & R3Q1: The impact of batch size on IPH. A: To analyze impact of batch size, we train TEAR with various batch sizes as 128, 64, 32 and 16 on ISIC. The AUC scores in 350, 800, 1200 and 2000 label regimes are follows: Label | Batch size regimes | 128 | 64 | 32 | 16 350 | 80.63 | 80.11 | 79.85 | 78.72 800 | 85.50 | 85.25 | 84.87 | 84.04 1200 | 88.52 | 88.34 | 87.90 | 87.18 2000 | 89.87 | 89.61 | 89.26 | 88.54 With batch size reduced from 128 to 32, the proposed TEAR is stable, and outperforms state-of-the-art methods with batch size of 128 in Table 1 of the submitted version. Even using batch size of 16, the proposed TEAR is still comparable with other methods. Therefore, the TEAR performs stably in a suitable range of batch size, e.g., from 128 to 32 in ISIC, which proves our IPH is robust to the change of batch size. We will add analysis of batch size in the next version extension.

R1Q2: Unfair comparison of GLM and NM without strong augmentation. A: For a fair comparison, we adopt the same strong augmentations as TEAR to train GLM and NM models on ISIC as follows. The AUC scores in 350, 800, 1200 and 2000 label regimes indicate the proposed TEAR shows better performance under fair comparison of GLM and NM. Methods | Label regimes | 350 | 800 | 1200 | 2000 GLM | 79.63 | 85.44 | 88.06 | 89.12 NM | 78.29 | 83.81 | 86.15 | 86.98 TEAR | 80.63 | 85.50 | 88.52 | 89.87

R1Q3 & R3Q2: Ablations on visualizations of cluster centroids across different iterations, and the qualitative analysis. A: We have visualized the feature clustering results across different training iterations with and without the IPH. Due to the forbiddance of figures, we calculate the average inter-cluster and intra-cluster distance to assess the effect of the IPH. (1) With the IPH, the inter-cluster distance is 1.1e-3, larger than the distance of 9.0e-4 without the IPH, implying that the IPH can enlarge the inter-cluster distance. (2) Moreover, we obtain the average intra-cluster distance of 1.4e-4 with the IPH, smaller than the one without the IPH of 2.0e-4. Clusters with the IPH tend to be more compact during the training phase, indicating that IPH can constantly reduce the distance among similar samples. As such, the proposed TEAR can form compact clusters in the latent space and improve the performance.

R4Q1: In Eq.5, \sigma is the normalization function. However, u is a single sample (LE(u;t) is a scalar). How to normalize it? A: In Eq.5, for each unlabeled sample u, we first convert it into a scalar LE(u;t), which indicates the reliability of pseudo labels measured by the entropy and the discrepancy of predictions across two iterations. Then, the scalar LE value LE(u;t) is processed by Sigmoid function as the normalization to guarantee the range of relaxation intensity as [0.0, 1.0], which can calibrate pseudo labels via the modulation function. Note that the Sigmoid function is applied on the scalar LE value of each unlabeled sample u, instead of normalizing all unlabeled data together.

R4Q2: Value of b in Eq.5. A: In all experiments, b is 1.5.

R4Q3: Compare with mean-teacher. A: We train mean-teacher model on KC and obtain 76.44 and 79.37 AUC in 5% and 10% label splits. The proposed TEAR yields 80.24 and 83.17 AUC, verifying the superiority of TEAR over other iteration information-based methods.

R4Q4: Show the simple examples what case the relaxation is large or small and why. A: From Eq. 3, the relaxation will be large if the discrepancy between two predictions or the entropy of current prediction is large, and vice versa. Thus, pseudo labels are less reliable if the predictions are largely distinct across two iterations, and the relaxation can reduce the affect of pseudo label noises on SSL.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal has resolved concerns about the batch size as well as other concerns. I vote for acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This work proposes a semi-supervised framework for medical image classification, with an AdaPL module for relaxing hard pseudo labels to soft form and an IPH module for aligning feature prototypes across different training iterations. Originally, the concerns were mainly raised on insufficient ablation studies. The author addressed most concerns from the reviewers in the rebuttal. Recommend to accept and ask the authors to reflect the rebuttal points in the paper if finally accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper studies the semi-supervised image classification problem by using adaptive pseudo labelling and iterative prototype consistency. The idea appears interesting and the theoretical analysis appears solid. Good experimental results have been reported to support the presented method.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



back to top