Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Tom van Sonsbeek, Xiantong Zhen, Dwarikanath Mahapatra, Marcel Worring, Cees G. M. Snoek

Abstract

Deep learning models have shown a great effectiveness in recognition of findings in medical images. However, they cannot handle the ever-changing clinical environment, bringing newly annotated medical data from different sources. To exploit the incoming streams of data, these models would benefit largely from sequentially learning from new samples, without forgetting the previously obtained knowledge. In this paper we introduce LifeLonger, a benchmark for continual disease classification on the MedMNIST collection, by applying existing state-of-the-art continual learning methods. In particular, we consider three continual learning scenarios, namely, task and class incremental learning and the newly defined cross-domain incremental learning. Task and class incremental learning of diseases address the issue of classifying new samples without re-training the models from scratch, while cross-domain incremental learning addresses the issue of dealing with datasets originating from different institutions while retaining the previously obtained knowledge. We perform a thorough analysis of the performance and examine how the well-known challenges of continual learning, such as the catastrophic forgetting exhibit themselves in this setting. The encouraging results demonstrate that continual learning has a major potential to advance disease classification and to produce a more robust and efficient learning framework for clinical settings. The code repository, data partitions and baseline results for the complete benchmark are publicly available.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_31

SharedIt: https://rdcu.be/cVRrO

Link to the code repository

https://github.com/mmderakhshani/LifeLonger

Link to the dataset(s)

https://medmnist.com/


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a benchmark for different continual learning scenarios (task incremental, class incremental and cross-domain incremental) base on the MedMNIST dataset. They provide the setup of the benchmark and run multiple state-of-the-art continual learning methods as baselines.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The authors provide an extensive evaluation of five baseline methods for task and class incremental learning.
    • Using a public available dataset such as the MedMNIST dataset is highly beneficial for a benchmark.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • One of the main novelties of the benchmark the cross-domain incremental learning is poorly motivated. It is not clear how the knowledge learned on e.g. BloodMNIST could be beneficial for learning on data of e.g. PathMNIST.
    • The evaluation for cross-domain incremental learning is limited with only three baseline methods compared. Why are the other approaches not used for cross-domain incremental learning?
    • For all evaluations an upper bound by running joint training on all tasks/classes at once is missing. Such an upper bound is relevant to judge the performance of the continual learning methods.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code of the paper will be made available publicly. From the checklist authors claim that the software framework and runtime/memory footprint statistics are included, however both are not included in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Authors should provide more insight into how cross-domain incremental learning could be beneficial by e.g. showing results of an experiment showing that subsequent domains could benefit from previous learned knowledge when fine tuned.

    • Rather than cross-domain incremental a commonly tackled continual learning scenario is domain incremental learning, where the domains are usually more related than in the proposed benchmark. Examples for such work are: – Perkonigg, M., et al. “Dynamic memory to alleviate catastrophic forgetting in continual learning with medical imaging.” Nature Communications 12.1 (2021): 1-12. – Gonzalez, C., Georgios S., and Mukhopadhyay, A.. “What is Wrong with Continual Learning in Medical Image Segmentation?.” arXiv preprint arXiv:2010.11008 (2020). – Srivastava, S. et al. “Continual domain incremental learning for chest x-ray classification in low-resource clinical settings” (2021). It might be worth to include this domain incremental setting into the benchmark.

    • A more detailed discussion of the results could help to gain insights into the results. For example the differences in performance of class incremental learning on different datasets is large. Why is CI learning on TisseMNIST only reaching 32.0, while on BloodMNIST 67.7 is possible? Could authors offer an explanation/intuition for that?

    • The benchmark presented is limited to 2D on relatively small images, which is an important step in defining benchmarks for continual learning in medical imaging settings. However, a 3D version of such a benchmark is needed to truly evaluate the potential of different CL methods.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While task and class incremental learning are clearly defined and evaluated sufficiently, the motivation and evaluation of cross-domain incremental learning remains unclear. In addition more discussion on the results in all continual learning scenarios would be needed for accepting the paper.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    This paper presents a benchmark for continual learning algorithms on 4 datasets from the MedicalMNIST collection, for multi-class classification. They evaluate state of the art continual learning algorithms in task-, class- and domain- incremental learning settings. The cross-domain incremental learning setting is newly defined in this paper, where each domain is a different dataset, with different classification task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • a benchmark for continual learning algorithms on medical data is needed
    • the authors compare state of the art continual learning methods on 4 medical classification tasks
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • commonly in domain adaptation, different domains are given by e.g. data from different hospitals, but the task stays the same. Here, domains are defined as different datasets with also different tasks, e.g. CT scans for organ classification vs kidney microscopy images for cell classification.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors plan to publish their code

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • the definition of the cross-domain learning setting is confusing. Commonly in domain adaptation, different domains are given by e.g. data from different hospitals, different modalities or scanners, but the task (e.g. classification of blood cells) stays the same. I wonder if it is needed to have a model that can classify such different images as CT scans and microscopy images.
    • how did the authors define the split into different tasks? Does the order of learning the different tasks/ classes make a difference?
    • In Figure 3 it should be ‘…right column indicates the average accuracy….’
    • it would be interesting to add an upper bound like joint training on the whole dataset, similar to [1], to the comparison.

    [1] Van de Ven, Gido M., and Andreas S. Tolias. “Three scenarios for continual learning.” arXiv preprint arXiv:1904.07734 (2019).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although a benchmark for continual learning methods for medical image classification is interesting and important for the field, the definition of the cross-domain incremental learning setting is not clinically relevant.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The authors provided upper bound scores for tables 2&3 as requested. I agree with R3, that a forward transfer metric should be included, which the authors also want to include. Overall I think this paper is interesting for the community, and agree with R3, that it helps to set the path for a more standardized evaluation of continual learning methods.



Review #3

  • Please describe the contribution of the paper

    Similar to the “SplitMNIST” and “PermutedMNIST” benchmarks commonly used in continual learning literature, the authors propose using “MedMNIST” as a simple, computationally inexpensive benchmark dataset for continual disease classification. They evaluate five different methods in three (or four, depending on how these are viewed) settings and report the average accuracy and forgetting scores.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Using MedMNIST as a standardized dataset for benchmarking continual learning methods in the clinical domain is a simple yet sensible idea. The data is small enough that experiments would require minimal computational resources, but the results would be likelier to transfer to medical applications than those obtained with datasets such as SplitMNIST or SplitCIFAR.

    • Figure 1 successfully helps illustrate the setting and the different learning scenarios.

    • While many papers have compared EWC, MAS, LwC and iCarL, the fact that the authors include “End-to-End Incremental Learning” as a bias correction method is a good idea, especially for the class incremental experiments.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Though I understand the advantages of a computationally inexpensive benchmark dataset, MedMNIST includes very unrealistic images with a resolution of 28×28. It is unclear how results in this benchmark would translate to more realistic medical image applications.

    • Considering that the main contribution lies in sharing benchmark results, a major weakness lies in the choice and definition of the metrics. The forgetting metric, which observes the difference “between the highest and lowest accuracy for each task” does not follow conventions in continual learning literature that quantify forgetting as the difference in performance directly after training the model with a certain task and after continuing training with future tasks (Díaz-Rodríguez at al.). The authors also do not measure forward transfer or include any metric that quantifies loss in model plasticity.

    • For the “cross-domain incremental learning” setting, the “domains” are too different. It is unclear how much this would mimic the situation that the authors describe, namely that of “datasets originating from different institutions.” In addition, the authors name the introduction of this scenario as a key contribution, yet previous research has looked at similar settings within medical imaging (Perkonigg at al., Srivastava at al., Memmel at al.), even research that the authors cite (Srivastava at al., Memmel at al.).

    • It would have been preferable to evaluate a pseudo-rehearsal method alongside iCarl, which can be considered an upper bound in many scenarios.

    References: (Díaz-Rodríguez at al.) Díaz-Rodríguez N, Lomonaco V, Filliat D, Maltoni D. Don’t forget, there is more than forgetting: new metrics for Continual Learning. In Workshop on Continual Learning, NeurIPS 2018 (Neural Information Processing Systems 2018 Dec 7. (Perkonigg at al.) Perkonigg M, Hofmanninger J, Herold CJ, Brink JA, Pianykh O, Prosch H, Langs G. Dynamic memory to alleviate catastrophic forgetting in continual learning with medical imaging. Nature Communications. 2021 Sep 28;12(1):1-2. (Srivastava at al.) Srivastava S, Yaqub M, Nandakumar K, Ge Z, Mahapatra D. Continual domain incremental learning for chest x-ray classification in low-resource clinical settings. InDomain Adaptation and Representation Transfer, and Affordable Healthcare and AI for Resource Diverse Global Health 2021 Oct 1 (pp. 226-238). Springer, Cham. (Memmel at al.) Memmel M, Gonzalez C, Mukhopadhyay A. Adversarial continual learning for multi-domain hippocampal segmentation. InDomain Adaptation and Representation Transfer, and Affordable Healthcare and AI for Resource Diverse Global Health 2021 Oct 1 (pp. 35-45). Springer, Cham.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors fulfill all necessary reproducibility criteria.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • When outlying the contributions, I suggest that the authors focus on the first of the three contributions listed. As previously stated, scenarios similar to the proposed “cross-domain incremental learning” have been explored in the past, so I would not deem this to be a separate contribution. The third point is “We explore task and class incremental learning scenarios of continual learning, to respond well to new labels i.e. diseases for multi-class disease classification.”, and it is unclear how this is different from the first contribution where the different scenarios are already presented.

    • I suggest that the authors rephrase the definitions for the continual learning scenarios in the abstract, as the definitions in the main text are much more understandable. One could also argue that there are four, not three, scenarios as “cross-domain incremental learning” can be “domain-aware” or “domain-agnostic”.

    • What does “fine-grained cross-domain incremental learning” stand for?

    • The explanations for well-known concepts in continual learning, such as the introduction of catastrophic forgetting, take up too much space. I would suggest that the authors instead focus on the necessity of a unified benchmark dataset to drive forward continual learning research in the medical imaging community.

    • When introducing continual learning strategies in page 5, I would suggest that the authors combine the subtitles with the text, e.g. instead of “\textbf{Regularization methods} They reduce” “\textbf{Regularization methods} reduce”. This would save space and be easier to read.

    • In the definition of the “average accuracy” metric, “t” is used for both the number of tasks until a certain task as well as the index of the last task. I would suggest that the authors use a different symbol for the number of tasks.

    • The titles in Table 1 are too close together, making the table difficult to read. I would also suggest that the authors separate Table 2 and Table 3 into different pages.

    • Please explain directly why Table 3 only has 3 rows (excluding the lower baseline and several methods). Also, please explicitly state that Table 3 contains results for the “Cross-domain incremental learning” scenario.

    • When stating “we train the model […] with the option of early stopping in the occurrence of overfitting.”, please explain the early stopping strategy.

    • “by combining regularization term with the classification loss” is missing a “the” or “a” before “regularization.”

    • “toward classes, associated with the most recently learnt task.” I would suggest removing the comma and using “learned”.

    • “clinical practise” –> “clinical practice”

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While I am unsure about how transferable MedMNIST is as a benchmark for continual medical image classification, establishing the three proposed settings as a first/complementary way to evaluate continual learning methods would help standardize the evaluation in the medical domain. However, metrics should follow existing conventions and include forwards transfer.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I continue to recommend accepting this paper. I still believe that the transferability of the results on such small 2D images is questionable, and I agree with R1 that the domain incremental learning scenario is not well-represented by the proposed setting of using datasets from different modalities. However, in a field where standardization is lacking such as continual learning for medical imaging, this paper is a valuable contribution. The changes made for the rebuttal (such as including results for the joint learning setting) are also meaningful.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Although the reviewers found merits in the paper, they raise several concerns. Some include the lack of proper motivation and evaluation of cross-domain incremental learning, unclear clinical relevance, and insufficient discussion of the results.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5




Author Feedback

We thank all reviewers and the AC for their encouragement and constructive feedback. Next we provide our clarification about all points raised by the reviewers.

*** Motivation and clinical relevance of cross-domain incremental learning (R1, R2, R3) State-of-the-art continual learning solutions impose restrictions on the input space and target space, for example it is assumed that all tasks have the same target space [7, 8, 16, 25, 32] or all tasks originate from the same (partial) dataset [7, 8]. To better mimic a future clinical scenario, where a complete diagnosis of a patient is required beyond a single medical modality, our cross-domain incremental learning setting assumes that tasks can come from different medical modalities and datasets, eg: PathMNIST, TissueMNIST, OrganMNIST. In that sense, a cross-domain incremental learner may act as a ‘general AI practitioner’, being able to infer potential diseases from different modalities, which motivates its introduction in our benchmark. To answer R1, while the high-level knowledge from BloodMNIST may not be useful for PathMNIST, the low-level knowledge, such as edges, contours, etc, may prove beneficial during representation learning. Additionally, the ability to aggregate knowledge from data coming from different institutions, without the need to re-train from scratch, provides another benefit for cross-domain incremental learning. We will better clarify the motivation in the introduction, as well as clarify the difference with [16, 25, 32].

*** Evaluations and upper-bounds (R1, R2, R3) The reviewers are right, upper bound performance will help to better judge the considered CL baselines. We follow R2’s suggestion and provide the multi-task learning average accuracy for each benchmark as the upper bound. Specifically, for Table 2 and 3: BloodMNIST: 97.98 (+- 0.18), PathMNIST: 93.52 (+- 1.91), OrganaMNIST: 95.22 (+- 0.37), TissueMNIST: 91.27 (+- 0.87). For Table 4: cross domain incremental learning: 93.28 (+- 0.28). To R2, we provide splits according to five different random seeds and report their average performance. To R3, for the forgetting metric we follow [7, 8], we will add the suggested forward transfer metric. For the cross-domain incremental learning we report the best three CL methods based on their performance on the task- and class- incremental learning, we agree with R1 that we might as well report all five methods and will update Table 4 accordingly in the camera-ready version. To R1 and R2, we consider a 3D version of the benchmark with high-resolution imagery a proper challenge for future work. Thank you.

*** Discussion of results (R1) By request of R1 we will expand the discussion of the results in the camera-ready version. We explain the lower scores on TissueMNIST w.r.t BloodMNIST by their different medical modalities and label spaces. The lower scores reflect the difficulty of the disease classification task for TissueMNIST due to the fewer discriminative features in the input images compared to BloodMNIST.

We accept the remaining suggestions by the reviewers to improve our presentation and figures and we kindly express our gratitude.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal has addressed the reviewers’ concerns and the post-rebuttal comments/discussions revealed that the reviewers are to the most part satisfied and increased their scores. The paper is on an interesting topic and can benefit the community.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper presents a benchmark for continual learning based on MedicalMNIST and evaluates SOTA continual learning algorithms in task-, class- and domain- incremental learning settings. Originally, the concerns were mainly raised on the lack of proper motivation and evaluation of the cross-domain incremental learning setting. The author addressed most concerns from the reviewers in the rebuttal. Recommend to accept and ask the authors to reflect the rebuttal points in the paper if finally accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Reviewers had concerns about the lack of proper motivation and evaluation of the cross-domain incremental learning setting. The rebuttal addressed these concerns. The AC recommend to accept this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



back to top