Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Xiaohan Xing, Zhen Chen, Zhifan Gao, Yixuan Yuan

Abstract

Noisy annotations are inevitable in clinical practice due to the requirement of labeling efforts and expert domain knowledge. Therefore, medical image classification with noisy labels is an important topic. A recently advanced paradigm in learning with noisy labels (LNL) first selects clean data with small-loss criterion, then formulates the LNL problem as semi-supervised learning (SSL) task and employs Mixup to augment the dataset. However, the small-loss criterion is vulnerable to noisy labels and the Mixup operation is prone to accumulate errors in pseudo labels. To tackle these issues, this paper presents a two-stage framework with novel criteria for clean data selection and a more advanced Mixup method for SSL. In the clean data selection stage, based on the observation that gradient space reflects optimization dynamics and feature space is more robust to noisy labels, we propose two novel criteria, i.e., Gradient Conformity-based Selection (GCS) and Feature Conformity-based Selection (FCS), to select clean samples. Specifically, the GCS and FCS criteria identify clean data that better aligns with the class-wise optimization dynamics in the gradient space and principal eigenvector in the feature space. In the SSL stage, to effectively augment the dataset while mitigating the disturbance of unreliable pseudo-labels, we propose a Sample Reliability-based Mixup (SRMix) method which selects mixup partners based on their spatial reliability, temporal stability, and prediction confidence. Extensive experiments demonstrate that the proposed framework outperforms state-of-the-art methods on two medical datasets with synthetic and real-world label noise.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_8

SharedIt: https://rdcu.be/dnwJq

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    The paper proposes a new paradigm in learning with noisy labels (LNL) based on using Gra- dient Conformity-based Selection (GCS) and Feature Conformity-based Selection (FCS) for clean sample selection and Sample Reliability-based Mixup (SRMix) for selecting mixup partners.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The problem of learning with noisy labels is an interesting problem to solve especially in the medical domain. The paper proposes significant ideas to improve this training paradigm and conducts extensive experimentation on variety of datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper lacks a few critical items:

    • The similar framework has been proposed and tested in a paper in cvpr 2022: UNICON: Combating Label Noise through Uniform Selection and Contrastive Learning. Many elements look the same and distinction between the two work and experimental results comparing these two framework seems appropriate.
    • The paper uses strong claims without justification and by just citing some papers (references 7, 8, and 15)
    • Many hyperparametes used in the paper are not discussed in the experiments. Parameters like gamma factor, K in the KNN, etc.
    • I am also quite confused and concerned that the sample reliability factor is the product of three loss terms. Why multiplication and not summation? Doesn’t it make the training unstable?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Seems appropriate

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The paper proposes to solve an important research question and several great ideas are introduced in the paper. The weaknesses mentioned earlier would require some revisions.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Given the feedback above, I believe the paper benefits from some revisions and additional experiments.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper addresses the problem of label noise in medical image classification tasks and proposes a two-stage framework with novel criteria for clean data selection and a more advanced Mixup method for semi-supervised learning. The proposed Gradient Conformity-based Selection (GCS) and Feature Conformity base selection (FCS) criteria identify clean data that aligns with the optimization dynamics in the gradient space and principal eigenvector in the feature space. To avoid errors caused by unreliable pseudo labels, the Sample Reliability-based Mixup (SRMix) method is proposed. Extensive experiments demonstrate that the proposed framework outperforms state-of-the-art methods on two medical datasets with synthetic and real-world label noise

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    In medical image classification, learning from noisy labels is an important problem. The proposed clean data selection via gradient and feature conformity-based selection is reasonable and novel. In the semi-supervised learning stage, a Sample Reliability-based Mixup (SRMix) is proposed that can mitigate the effect of unreliable pseudo-labels and hence improve the overall performance when combined with the clean data selection strategies (evident from the ablation study). The paper is well-written, and the comparison experiments are comprehensive.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Overall, the manuscript is well arranged. Following are few weaknesses:

    • Some technical details are missing. For example, it is mentioned in the manuscript that the reliability threshold τR is set as 0.2 for the WCE dataset and 0.05 for the histopathology dataset. However, it is not mentioned how these thresholds are selected. In this case, the performance of the proposed method cannot be generalized to other medical imaging datasets. • Mathematical equations in Section 2.1 are mixed with the text making it difficult to understand for the readers. • A few typos, e.g in Section 3.1, Page 7, line 3, β1 is mentioned two time, though it should be β1 and β2. • The dotted arrows in Fig. 1 can be made more clear by increasing the size.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Yes, code will be released

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please see para 6

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript is well-written, and the ideas and methodology are well-structured and presented. Experiments are solid, and the proposed framework outperforms state-of-the-art methods. The contribution of each module is well explained via ablation study.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I would like to maintain my initial rating as I had no major negative observations. R4 agrees with my assessment. The rebuttal mainly addresses the R2’s comments.



Review #4

  • Please describe the contribution of the paper

    The paper presents a novel method for combating label noise in learning tasks by utilizing gradient conformity-based selection (GCS) and feature conformity-based selection (FCS) for clean data identification. It further proposes a Sample Reliability-based Mixup (SRMix) method to mitigate error accumulation of pseudo labels in model training. The experiments show that the proposed method outperforms existing state-of-the-art methods in learning with noisy labels under diverse synthetic and real-world noise settings, demonstrating its effectiveness in addressing label noise challenges.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Novel formulation: The paper introduces a novel approach to tackle label noise by combining gradient conformity-based selection (GCS) and feature conformity-based selection (FCS) for clean data identification. This combination allows for a more accurate and robust selection of clean samples, addressing the limitations of existing methods.

    2. Innovative data usage: The proposed Sample Reliability-based Mixup (SRMix) method mitigates error accumulation in pseudo labels during model training. By leveraging sample reliability information, the method enhances learning performance while reducing the impact of noisy labels.

    3. Strong evaluation: The paper provides a thorough evaluation of the proposed method on multiple benchmark datasets under different noise scenarios. The experiments demonstrate the effectiveness of the approach in handling label noise, as it outperforms state-of-the-art methods in terms of accuracy and robustness.

    4. Practical applicability: The method is not limited to specific domains, making it widely applicable across various learning tasks and scenarios where label noise is a common challenge. This versatility increases the potential impact of the proposed approach.

    5. Clear presentation: The paper is well-structured and well-written, making it easy to understand the proposed method, its rationale, and its advantages. This clarity enhances the paper’s contribution to the field.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Theoretical justification: The paper could benefit from a more in-depth theoretical analysis of the GCS and FCS components and their interaction, as well as the SRMix method. This would help readers understand the underlying principles and assumptions of the approach, and potentially reveal possible limitations or future improvements.

    2. Computational complexity: The authors do not discuss the computational complexity of the proposed method, which might be a concern for large-scale datasets or real-time applications. Analyzing the computational cost and discussing potential optimizations could strengthen the paper.

    3. Insufficient explanation of hyperparameters: The paper does not provide a clear explanation or justification for the choice of hyperparameters used in the experiments. A more detailed discussion on the selection and tuning of hyperparameters would strengthen the reproducibility and robustness of the proposed method.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper provides sufficient detail in its methodology and experimental setup to ensure reproducibility of the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Consider providing a more in-depth theoretical analysis of the GCS and FCS components and their interaction, as well as the SRMix method, to help readers understand the underlying principles and assumptions of the approach.

    2. Analyze the computational complexity of the proposed method and discuss potential optimizations or trade-offs to improve the practical applicability of the approach.

    3. Provide a clear explanation and justification for the choice of hyperparameters used in the experiments to improve the reproducibility and robustness of the proposed method.

    4. Overall, the paper presents a promising approach to tackling label noise in learning tasks, and the thorough evaluation demonstrates the effectiveness of the proposed method.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel approach to tackling label noise in learning tasks with a strong evaluation on multiple benchmark datasets. While the lack of theoretical analysis, computational complexity, and insufficient explanation of hyperparameters are weaknesses, the paper’s strengths, including the innovative data usage and practical applicability, make it a good paper with moderate weakness, and a valuable contribution to the field of learning with noisy labels.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The review comments are mixed. The authors need answer the reviewers’ questions in rebuttal.




Author Feedback

We sincerely thank AC and all Reviewers for their constructive comments, and for agreeing on that our method is well-motivated and novel. Below please find the responses to the main comments.

R2Q1: Difference between the proposed method and UNICON.

  • Our method is totally different from UNICON. For clean data selection, we propose a novel criterion based on feature and gradient conformity, while UNICON is based on the commonly-used small loss criterion in a class-balanced manner. In the SSL stage, we propose a novel SRMix method that selects mixup partners based on their reliability, while UNICON introduces contrastive learning to combat noisy labels. Although both methods consist of the clean data selection and SSL stage derived from DivideMix, our method reveals significant novelty. On the WCE dataset, the UNICON achieves an accuracy of 93.76%, 89.45%, 83.69%, and 89.80% under the 20%, 40%, 50% sym. noise, and 40% pairflip noise, which is inferior to our method shown in Table 1.

R2Q2: Justification of claims.

  • The claims in our paper are consistent with common sense in the LNL community, and experimental results in our paper can well justify these claims. First, in the ablation results of Table 3, compared with the small-loss criterion in the output space (1st line), our proposed FCS criterion in the feature space led to performance gains (3rd line). This result verifies our first claim: the feature space is more robust to corrupted labels than the output space [7]. Second, the t-SNE visualization result in Fig. S3 (b) of the supplementary file shows that samples with the same true labels have similar gradients. This result supports the second claim: optimization dynamics can reflect the true class information (training samples from the same class usually exhibit similar optimization dynamics) [8, 15].

R2Q3 & R3Q1 & R4Q3: Explanation and discussion of hyperparameter selection.

  • The hyperparameters were tuned before the 5-fold cv. In specific, we split the dataset into 70% train+10% validation+20% test and tune hyperparameters on the 10% validation set. We discuss the influence of crucial hyperparameters on the WCE dataset with 40% pairflip noise. When gamma (proportion of the anchor set) = 5, 10, 20, 50, and 100, the classification accuracy is 89.79%, 92.33%, 92.38%, 91.34%, and 79.97%, respectively. When a small portion of anchor samples are selected (gamma=5), the principal gradients and feature eigenvectors are not very representative, thus leading to worse performance (89.79%). When gamma is very large (gamma=50 or 100), some noisy samples might be included in the anchor set and degrade model performance. For KNN in the gradient computation, K = 1, 5, 10, and 20 leads to the accuracy of 91.12%, 91.23%, 92.38%, and 92.32%, respectively. This result suggests that a reasonably large number of neighbors lead to more accurate and stable computation. Before submitting the final version, we will add more detailed discussions of the hyperparameters to the supplementary file.

R2Q4: Why use the product rather than summation to compute the sample reliability factor? Does it make the training unstable?

  • On the 4 noise settings of the WCE dataset, the summation-based method leads to an accuracy of 94.26%, 91.61%, 83.79%, and 92.49%, respectively, which are inferior to our method based on the product (94.65%, 92.44%, 86.59%, and 92.38% shown in Table 1). In our method, the three reliability terms are all normalized to the range of [0, 1], thus their product is also within the range of [0, 1], and does not make the training unstable.

R4Q2: Computational complexity & Real-time application.

  • Our method yields significant performance gains with slightly higher computational cost (training time: 81.43s/epoch) than the baseline DivideMix (training time: 78.61s/epoch). The inference time for each sample is 3.93ms, thus can meet the requirements of real-time application.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The reviewers still have concern after rebuttal and decide to reject this paper.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The primary concern of R2 was the method’s novelty wrt UNICON. The rebuttal does a good job in addressing this and other concerns raised by reviewers R3 and R4. I feel this paper deserves to be accepted



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper received mixed initial reviews. The critical question from R1 is about the citation of a CVPR 2022 paper. In the rebuttal, the authors explained their difference and reported the performance difference. Among three reviewers, R2 posted post-rebuttal comments while other not. The authors are encouraged to cite the CVPR 2022 paper and report the comparison results in their camera ready version. An acceptance is recommended.



back to top