Authors

Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen

Abstract

Deep learning techniques have achieved promising performance for computer-aided diagnosis, which is beneficial to alleviate the workload of clinicians. However, due to the scarcity of diseased samples, medical image datasets suffer from an inherent imbalance, and lead diagnostic algorithms biased to majority categories. This degrades the diagnostic performance, especially in recognizing rare categories. Existing works formulate this challenge as long-tails and adopt decoupling strategies to mitigate the effect of the biased classifier. But these works only use the imbalanced dataset to train the encoder and resample data to re-train the classifier by discarding the samples of head categories, thereby restricting the diagnostic performance. To address these problems, we propose a Multi-view Relation-aware Consistency and Virtual Features Compensation (MRC-VFC) framework for long-tailed medical image classification in two stages. In the first stage, we devise a Multi-view Relation-aware Consistency (MRC) for representation learning, which provides the training of encoders with unbiased guidance in addition to the imbalanced supervision. In the second stage, to produce an impartial classifier, we propose the Virtual Features Compensation (VFC) to recalibrate the classifier by generating massive balanced virtual features. Compared with the resampling, VFC compensates the minority classes to optimize an unbiased classifier with preserving complete knowledge of the majority ones. Extensive experiments on two long-tailed public benchmarks confirm that our MRC-VFC framework remarkably outperforms state-of-the-art algorithms.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_2

SharedIt: https://rdcu.be/dnwJk

Link to the code repository

https://github.com/jhonP-Li/MRC_VFC

Link to the dataset(s)

N/A

Reviews

Review #3

Please describe the contribution of the paper

The authors presented a novel framework consisting of Multi-view Relation-aware Consistency (MRC) and Virtual Features Compensation (VFC) for long-tail medical classification. The proposed MRC uses a student-teacher framework to encourage consistency between augmented views, facilitating the capture of meaningful semantic information. The VFC generates balanced virtual features for all classes to train the classifier. The authors have evaluated the model in two long-tailed dermatology datasets and it outperforms the state-of-the-art methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper’s writing and organization are excellent. The existing literature’s current limitations are well-stated, and solutions are offered. The proposed MRC and VFC are reasonable approaches to these limitations and are novel in improving long-tail classification. The MRC allows the encoder to capture inherent semantic features under different data augmentations, while the VFC generates virtual features under a multivariate Gaussian distribution to build an unbiased feature space. The proposed method is compared with several other methods and demonstrates very positive results.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The approach presented by the author is novel and has outperformed state-of-the-art methods, as evidenced by the results. However, it is important for the author to address some of the concerns highlighted in the detailed comments, particularly with respect to discrepancies with the results reported by the previous method [2] and the reporting of central tendency and variance.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors used two publicly available datasets. The authors haven’t released the code, however, the work should be reproducible based on the descriptions. The authors haven’t reported the mean and standard deviation despite the claim in the reproducibility form.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

○ The authors should clearly mention the type of accuracy used in the evaluation. Is it the overall accuracy or an average of per-class accuracy? This is important since the authors have used an imbalanced test set, dominated by majority classes, and evaluating using overall accuracy on an imbalanced test set does not make sense. Gouabou et al. [1] evaluated ISIC-2019-LT using balanced accuracy. ○ The experimental setup and dataset used by the author are almost similar to that presented in Ju, Lie, et al.[2]. However, my concern is regarding the table in ISIC-2019-LT [2], which have also used ResNet18 as in this paper. If the same evaluation metric was used in both cases, why are the results in some of the baselines significantly different compared to [2]? ○ Additionally, it seems that the authors have only reported the results from single trail experiment. As shown in [2], multiple trials are important for reporting central tendencies and variance.

Minor: ○ On page 4, it mentions that “the parameters of the teacher model are updated via an exponential moving average of the student parameters.” Is there a reference? ○ Please consider adding references to “different from existing resampling methods” on page 5. ○ Please consider adding references to the “expectation-maximization algorithm.” on page 5. ○ Also, please add the number of epochs the model was trained for, otherwise, it’s hard to reproduce. ○ Please consider adding an interpretation or reference to the multivariate Gaussian distribution used in Stage 2. It leaves readers confused about why the multivariate Gaussian distribution is adopted to generate virtual features.

Ref: [1] Foahom Gouabou, Arthur Cartel, et al. “End-to-End Decoupled Training: A Robust Deep Learning Method for Long-Tailed Classification of Dermoscopic Images for Skin Lesion Classification.” Electronics 11.20 (2022): 3275. [2]Ju, Lie, et al. “Flexible Sampling for Long-Tailed Skin Lesion Classification.” Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part III. Cham: Springer Nature Switzerland, 2022.

○ The approach presented by the authors is novel and intuitive to the reviewer. The current content is well-suited for the conference. In the future, the authors should include other datasets to demonstrate the superiority of this framework. The datasets used here were created using the Pareto distribution. However, in the natural dataset, such as retinal disease, histopathology, and chest X-ray, where the class distribution does not meet the Pareto distribution, whether the framework can still work is unknown. Therefore, it is truly needed to demonstrate the utility of the framework on other datasets.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well-structured and well-written. The proposed method is carefully considered and surpasses the current state-of-the-art benchmarks. However, my rating for the paper is not above 5 due to the absence of a sufficient explanation for the discrepancies between some reported baselines and previously published papers that addressed the same problem. Otherwise, it should score 6.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The paper addresses the challenge of unbalanced datasets, known as the long tail problem. To this end, the authors propose a combination of teacher-student feature learning and an oversampled classifier learning step, both with adaptations to address the aforementioned challenge. The approach is evaluated on two datasets, both dealing with skin cancer, and the results show that the proposed method improves on the current state of the art. The experiments are further supported by a small ablation study.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well done and addresses a relevant challenge in medical imaging. Overall, the current state of the art is well addressed and the contributions of the paper are not overstated. I really enjoyed reading the paper and the proposed solutions can be adapted to other learning situations.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The main weaknesses of the paper are the rather incremental improvements, which may reduce its impact, and the limited evaluation. None of this is really serious, however, and it is still clearly in the realm of a MICCAI paper.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Although it is stated that the code for the method and experiments is available, I did not find this information in the paper. The methods described in the paper seem sufficient to reproduce the results. The reproducibility points given are mostly valid and agree with the reported details, with the exception of the code release.

The following items are checked in the reproducibility check and are not found in the paper: I) The hyperparameter selection procedure II) Number of training runs III) Results with central tendency and variation IV) Run time V) Memory usage (reported as not relevant, but would be relevant)
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
The paper is generally well written and the results are in good condition. However, there are a few points that could be addressed to further improve the paper:
1. First, the approach is only evaluated on similar data. It would be interesting to see how it performs on different problems. Not only to understand the performance better, but also to see its strengths and weaknesses more clearly. For example, what happens in a binary label setting? What happens with large/small data sets? I understand that this point may be out of focus for the rebuttal, but I would suggest addressing it in further publications/research.
2. The sampling step (around eq. 4 and eq. 5) reminds me of SMOTE. It would be helpful to discuss the differences between SMOTE and similar approaches more clearly.
3. Please discuss how the hyperparameters were tuned. Were they fixed at the start? Were they optimised by looking at the test set or just using the training/validation set?
4. The order of the references is odd. Why not start with 1,2,3 but with 5,18,19?
5. On the second page, “First, in the first stage” is duplicated and could be simplified.
6. On the second page, last paragraph: “Specifically, in the first stage, ….” This sentence was not a logical consequence of the previous details. I understood it afterwards, but it didn’t help me in the first place.
7. The naming of teacher-student is strange. Why is the teacher model updated by the student? Usually it is the other way round, as the student learns from the teacher.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The work is well written and sound. I found no major flaws in the work, making it a solid contribution. For an excellent contribution I would have expected a more in-depth evaluation or more unexpected results. Therefore, I rate the work as a solid contribution (Accept) with a tendency towards a strong contribution (Strong Accept).
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

This paper proposes a two-stage framework called MRC-VFC to learn an unbiased model from an imbalanced medical dataset. The first stage (MRC) learns an encoder of medical images regardless of imbalanced sample numbers of class labels, thanks to multi-view relation-aware consistency loss. Specifically, this stage incorporates correlation terms in loss function such that the encoder can capture the semantic information of images from differently augmented images. The second stage (VFC) trains the classifier and fine-tunes the encoder in a way that mitigates the issues caused by imbalanced sample sizes of various class labels. The paper verifies the framework in a long-tailed dataset from ISIC and ISIC-Archive.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper focuses on an important category of medical problems, i.e., learning from long-tail distribution data. The proposed methods directly contribute in that field of the research.
2. The proposed ideas (e.g., MRC module, fine-tuning encoder and classifier, etc.) are based on good intuitions. The design of MRC module is similar with the cited work due to Jinpeng Li et al., MICCAI 2022. By making use of augmented images, the first stage that minimizes the difference between student and teacher network can produce good data representations.
3. The evaluation comprehensively shows the benefit from the proposed methods. Each of the proposed two stages can lead to performance gain in the evaluation with the designed datasets.
4. This paper innovatively combines techniques of i) balancing the class distribution in embedding space with the estimated multi-variate Gaussian distributions and ii) alternatively (E-M steps) optimize the encoder and the classifier. This combination seems to be effective from the evaluation section (see Table 1 and Table 2).
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. While the authors proposed an MRC module to learning robust data representation, the loss function (i.e., L_{stage1}) of the first stage still has a component (i.e., L_{CE}) that uses the data labels, which tend to be biased by the imbalanced class distribution. Thus, it is not clear how the overall loss and the encoder can be affected by the imbalance factor.
2. The inner product (S_c) and outer product (S_b) of z should have the same information regardless of the rank. Is L_{batch} (or L_{channel}) redundant?
3. It is not clear to me the weak vs. strong augmentations in the stage 1. There are two major concerns regarding the two different augmentations of the used medical images: a) should the dermatological labels change due to the augmentations (e.g., can augmenting the color channel change the sample from the head class to the tail class?)? b) how to design these two augmentations to produce expected encoder model? What augmentations are considered as weak augmentation (in stage 1)?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Good. The paper designed a long-tail dataset based on a public dataset. The supplementary material provides more details about the label distributions of various classes. Also, Section 3.2 provides implementation details including setting the parameters. However, the authors didn’t mention the plan of releasing the code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. To avoid random effects, it would be good to show statistics (mean +/- standard deviation) of the results from multiple runs (in table 2).
2. Need more evaluation: The proposed methods can be more convincing if the authors validate the framework in more diverse datasets. It is of interest to see if the proposed method works in other medical (image) classification problems. Also, it would be better if the authors compare the methods with other two-stage methods (e.g., with the cited work Kang et al., Decoupling representation and classifier for long-tailed recognition).
3. Need more explanation of setting parameters such as \lambda’s: While the authors explicitly detailed the parameters (e.g., \lambda’s) in section 3.2, it is not clear how to set these parameters in a slightly different classification problem, e.g., given a different dataset.
4. Small issue that can be quickly fixed: Inconsistent methods’ names associated with citation [11] in section 3.3 and 3.4 (also in tables 1&2 and figure 2). In section 3.3, the methods is referred to as FCD, while the section 3.4 refers the method as FS.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presents slight improvement of the existing two-stage approaches that also address long-tail problems. The results show positive impact of the proposed improvement.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The details of the proposed method should be more clear. The improvements yielded by the proposed method seem to be incremental. The experimental results need a more clear explanation.

Author Feedback

We would like to thank all the reviewers for their positive comments and constructive suggestions. We carefully summarized the concerns of reviewers and gave detailed responses below.

R1Q1: The imbalance impact of L_{CE} in the first stage. A: Our MRC-VFC framework follows the decoupling strategy to train the encoder from abundant supervision and then re-train an unbiased classifier. Considering the encoder is unlikely to be affected by imbalance as the classifier, we adopt the L_{CE} used in the decoupling works. Moreover, the proposed MRC module provides multi-view consistency constraints regarding different data perturbations, which will not be biased by the imbalanced distribution. From the experiments, our MRC-VFC framework well addresses the challenge of long-tails in medical imaging.

R2Q2: The novelty of the VFC module. A: Different from the oversampling-based SMOTE which mixups the paired neighbors linearly, the VFC module generates virtual features from class-wise Gaussian distribution to compensate for the tail classes. The ablation study in Section 3.3 and 3.4 further proves the efficiency of the VFC module.

R1Q2: The difference between batch and channel correlations in Section 2.1. A: The batch correlation {S_b} measures the cross-sample relation, i.e., the correlations among data samples, while the channel correlation {S_c} measures the relation among feature maps.

R1Q3: Data augmentations. A: We have listed the transformations used in strong and weak augmentations in Section 3.2. The color jitter is implemented in the strong augmentations. However, as the MRC is designed to motivate the model to capture semantic information under different data augmentations, the impact of color change on dermatological diagnosis is not our primary concern.

R2Q7: The definition of the teacher and student model. A: As demonstrated in [*1], utilizing the exponential moving average of the student model produces a more accurate model. In our approach, we feed the weak-augmented data to the teacher model, resulting in more accurate predictions. This teacher model is then used to supervise the student model. Therefore, in our settings, the student model is learning from the teacher model.

[*1] Tarvainen, Antti, and Harri Valpola. “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.” Advances in neural information processing systems 30 (2017).

R1&R3: About citations. A: We will correct the reference of FS in Section 3.4 from [2] to [3]. We will add the references to EMA [1], resampling methods [4], multivariate Gaussian distribution [5], and EM [6] in the camera-ready version.

[2] Li, J., Chen, G., Mao, H., Deng, D., Li, D., Hao, J., Dou, Q., Heng, P.A.: Flat-aware cross-stage distilled framework for imbalanced medical image classification. In: MICCAI. pp. 217–226. Springer (2022) [3] Ju, L., Wu, Y., Wang, L., Yu, Z., Zhao, X., Wang, X., Bonnington, P., Ge, Z.: Flexible sampling for long-tailed skin lesion classification. In: MICCAI. pp. 462– 471. Springer (2022) [4] Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural networks 106, 249–259 (2018) [5] Ahrendt, Peter. “The multivariate gaussian probability distribution.” Technical University of Denmark, Tech. Rep (2005): 203. [*6] Moon, Todd K. “The expectation-maximization algorithm.” IEEE Signal processing magazine 13.6 (1996): 47-60.

R2Q5Q6: About writing. A: We will clarify the statement.

R3Q4: About the number of training epochs. A: We set training epochs as 100 for the first stage and 500 for the second stage. We will add these implementation details in the camera-ready.

back to top

Combat Long-tails in Medical Classification with Relation-aware Consistency and Virtual Features Compensation