Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Fengbei Liu, Yuanhong Chen, Yu Tian, Yuyuan Liu, Chong Wang, Vasileios Belagiannis, Gustavo Carneiro

Abstract

Real-world large-scale medical image analysis (MIA) datasets have three challenges: 1) they contain noisy-labelled samples that affect training convergence and generalisation, 2) they usually have an imbalanced distribution of samples per class, and 3) they normally comprise a multi-label problem, where samples can have multiple diagnoses. Current approaches are commonly trained to solve a subset of those problems, but we are unaware of methods that address the three problems simultaneously. In this paper, we propose a new training module called Non-Volatile Unbiased Memory (NVUM), which non-volatility stores running average of model logits for a new regularization loss on noisy multi-label problem. We further unbias the classification prediction in NVUM update for imbalanced learning problem. We run extensive experiments to evaluate NVUM on new benchmarks proposed by this paper, where training is performed on noisy multi-label imbalanced chest X-ray (CXR) training sets, formed by Chest-Xray14 and CheXpert, and the testing is performed on the clean multi-label CXR datasets OpenI and PadChest. Our method outperforms previous state-of-the-art CXR classifiers and previous methods that can deal with noisy labels on all evaluations. Our code is available at this https://github.com/FBLADL/NVUM

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_52

SharedIt: https://rdcu.be/cVRuD

Link to the code repository

https://github.com/FBLADL/NVUM

Link to the dataset(s)

https://github.com/FBLADL/NVUM/tree/main/dataset_preparation


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a novel training module, non-volatile unbiased memory (NVUM), which non-volatility stores running average of model logits for a new regularization loss on noisy multi-label problem. Experiments on multi-label chest X-ray images demonstrate the superior performance of the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The idea is relatively novel.
    2. Well written. The overflow and structure are very clear.
    3. The method is relatively clear.
    4. Experimental results are good.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Some points are not clear, like: Between Eq. (2) and Eq. (3), what is the definition of \pi_c? In Eq. (3), why utilizes z_k^i-log(\pi)? It is better to show its motivation.

    2. Since this paper aims to solve the noisy multi-label problem with class imbalance, it is better to individually provide the experiments results on noisy multiple labels, class imbalance, and noisy multi-label with class imbalance.

    3. Some related works are missed, like: [1] Laine, Samuli, and Timo Aila. “Temporal ensembling for semi-supervised learning.” ICLR (2017). [2] Shi, Xiaoshuang, et al. “Graph temporal ensembling based semi-supervised convolutional neural network with noisy labels for histopathology image analysis.” Medical image analysis 60 (2020): 101624.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The method is relatively simple and the implemental details are clear. So i think this paper has good reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. It is better to explain the following points: Between Eq. (2) and Eq. (3), what is the definition of \pi_c? In Eq. (3), why utilizes z_k^i-log(\pi)? It is better to show its motivation.

    2. Since this paper aims to solve the noisy multi-label problem with class imbalance, it is better to individually provide the experiments results on noisy multiple labels, class imbalance, and noisy multi-label with class imbalance.

    3. It is better to discuss the relationship to the following related works, which also generate the predictions using the similar form as Eq. (3), like: [1] Laine, Samuli, and Timo Aila. “Temporal ensembling for semi-supervised learning.” ICLR (2017). [2] Shi, Xiaoshuang, et al. “Graph temporal ensembling based semi-supervised convolutional neural network with noisy labels for histopathology image analysis.” Medical image analysis 60 (2020): 101624.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The idea is relatively novel.
    2. Well written. The overflow and structure are very clear.
    3. The method is relatively clear.
    4. Experimental results are good.
  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This work proposes a new regularisation loss to address multi-label medical image classification with label noise and class imbalance. The proposed loss aims to penalise differences between current and early-learning model logits. Besides, the proposed method leverages the logit adjustment technique to unbias the classification predictions arisen by the class-imbalance issue.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Noisy multi-label imbalanced learning is of great significance for both computer aided diagnose systems and clinical applications.

    The idea of penalising differences between current and early-learning model logits together with logit adjustment is interesting.

    The paper is well organized and the implementation details are well described.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The explanation of the effect of proposed regularization loss is insufficient. The four cases listed in Sec. 2.1 are all based on the strong assumption that networks have high confidence in the classification results. The author should rethink the gradient analysis of the noisy situation. In my opinion, author may consider the distribution of the multiplier of Jacobian matrix for clean samples and noisy samples.

    There is a lack of discussion of the class-level AUC results. In the experiment part, it’s insufficient to reflect how the method handles noisy and imbalance labels.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The implementation is enough for the reproducibility of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The author should rethink the gradient analysis of the proposed regularization loss.

    The author should consider the change of the label distribution and the multiplier of Jacobian matrix during training phase.

    The authors should consider to provide more exhibitions to highlight the role played by the proposed method.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea is interesting, although some explanation on method and experimental results are insufficient.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper focuses on a real-world robust learning problem, classification on noisy multi-labelled imbalanced dataset, and proposed a novel NVUM method based on non-volatile memory module paired with a new regularization loss to alleviate noisy label effect and introduce class prior knowledge in model update to unbias the classification. Experiments show that the method outperforms other SOTA methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper has a well-organized writing and clear motivation for each part of the proposed method: memory module with regularization term for noisy label issue, and class prior update for imbalanced issue.
    • The paper pay attention to the combined challenges in robust learning of MIA: noisy and imbalanced multi-label medical image classification, which is a common issue in the real-world datasets but rarely explored.
    • The paper provides detailed analysis of the gradient effect of the novel regularization term in noisy label classification with BCE loss.
    • The proposed method NVUM is evaluated on two benchmarks with real world noisy multi-label chestXray datasets, and achieves SOTA results with large performance gain compared with other SOTA methods.
    • Comprehensive ablation studies are reported and the effect of different hyper-parameter and prior settings are fully explored.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • One concern of NVUM is about the space size of the memory module, which is S*C and will linearly increase with the size of training set. Since the paper focuses on real world large scale datasets, when the training set gets extremely large, the size of memory module might be a severe issue.
    • There’s no ablation study for different noise level in training set. Most paper focuses on noisy label problems will introduce different level of noisy labels to the clean training dataset and evaluate their methods under multiple noise levels. This paper does not contain such ablation study, so people would have no idea about the noise tolerance of the proposed NVUM method.
    • Another potential problem is that, NVUM takes class prior distribution into account, however, this prior distribution is directly estimated from noisy training set, thus the prior distribution might be corrupted under severe noise level. This issue is mentioned in Future Works part.

    • One minor issue: Figure 1 is never referred to in the paper, and I think it’s demonstrating the training and updating pipeline of NVUM. Please add reference to Fig.1 in the paper.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    As the author indicates in Abstract, the code will be made available. In addition, the training hyper-parameters and preprocessing method are provided in paper. The datasets and model backbone are all open accessed. Thus, the work should be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Please add a discussion about the potential size increase of memory module.
    • It’s better to evaluate the proposed method on different noise levels by adding multiple level of noise to a clean multi-label dataset. This is important to explore the noise tolerance of the method.
    • I also suggest that exploring the class prior distribution estimation under different noise levels, since the prior might heavily degenerate when severe noise in training set.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper provides strong motivation, clear method analysis and comprehensive experiments with SOTA performance and large performance gain on two large real world datasets, with only minor weakness in experiments, therefore I recommend the paper being accepted.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper presents a novel approach to address some of the most pesky problems in multi-label classification, namely, noisy labels and class imbalance. The novelty is in the storage of intermediate information in a memory (NVUM) and its use in regularization. All three reviewers are positively inclined for the paper, but have raised important comments to be addressed in the rebuttal.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

We will include results using the public noisy-label medical image benchmark from [Zhang et al. “Alleviating Noisy-label Effects in Image Classification via Probability Transition Matrix.” BMVC’22] to test our method for different rates of symmetric noise. We tested our method on their ResNet-18 benchmark, where their baseline accuracy result using the 100% clean set is 64.4%. With 20% symmetric noise, our method without our prior reaches 61.3% and with our prior has 63.1% (best result in the BMVC’22 paper: 59.37%). With 40% symmetric noise, our method without our prior reaches 50.7% and with our prior has 53.4% (best result in the BMVC’22 paper: 49.65%).

We will add information about the memory footprint of our method. In particular, our memory module is a matrix with N x C dimension. We used debugging tools to analyse our memory module and noted that we only required 4 MB of GPU memory, and at each iteration, we only backpropagated through a small subset of the memory.



back to top