Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Pengbo Liu, Xia Wang, Mengsi Fan, Hongli Pan, Minmin Yin, Xiaohong Zhu, Dandan Du, Xiaoying Zhao, Li Xiao, Lian Ding, Xingwang Wu, S. Kevin Zhou

Abstract

There exists a large number of datasets for organ segmentation, which are partially annotated and sequentially constructed. A typical dataset is constructed at a certain time by curating medical images and annotating the organs of interest. In other words, new datasets with annotations of new organ categories are built over time. To unleash the potential behind these partially labeled, sequentially-constructed datasets, we propose to incrementally learn a multi-organ segmentation model. In each incremental learning (IL) stage, we lose the access to previous data and annotations, whose knowledge is assumingly captured by the current model, and gain the access to a new dataset with annotations of new organ categories, from which we learn to update the organ segmentation model to include the new organs. While IL is notorious for its ‘catastrophic forgetting’ weakness in the context of natural image analysis, we experimentally discover that such a weakness mostly disappears for CT multi-organ segmentation. To further stabilize the model performance across the IL stages, we introduce a light memory module and some loss functions to restrain the representation of different categories in feature space, aggregating feature representation of the same class and separating feature representation of different classes. Extensive experiments on five open-sourced datasets are conducted to illustrate the effectiveness of our method.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_68

SharedIt: https://rdcu.be/cVRwV

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a method to perform incremental learning (IL) of segmentation models using datasets with disjoint, potentially non overlapping annotations throughout the dataset. The authors propose to use a “light memory module” to make the location and approximate shape of previously seen anatomies persistent persistent during model training and a loss function that prevents the effects of conflicting labels for certain regions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper describes a method that seems to yield interesting results in the scope of incremental learning.

    Particularly interesting are the results in Table 2, where models trained sequentially on different datasets do not forget the information learned in previous steps and yield good results on all the different anatomies even though training for these anatomies has happened in the past.

    We can see that the “ours” row in table 2 shows good performance on all dataset, with an average performance close to the upper bound. As far as I have understood the authors have trained MargExcIL in 4 (or 5) separate rounds, and obtained these validation set results using the model obtained after the last round. Models such as FT, LwF, ILT, MiB were actually trained in the same way but since they did not have any IL specific strategy they basically fail on the first two tasks yielding very poor dice.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The clarity of the paper is in my opinion insufficient. Certain sentences are unclear:

    • “…Then the probability of classes not marked in ground truth or pseudo label will not be broken during training.” what does “broken” mean?
    • “These conflicts between prediction and ground truth break the whole training process.” I believe it would be much clearer to state that - for example - “These conflicts between prediction and ground truth are the reason the network “forgets” the knowledge learned in previous steps”

    There are more examples of such unclear sentences in the paper. (Eg. “all datasets in the meantime” which I suspect means “all datasets at the same time”)

    Fixing small mistakes is not as important as fixing the presentation of the paper. The bigger issue is the fact that I honestly could not fully understand the method itself!

    For example equations (1) and (2) as well as Fig. 2 contain notations and terms that have not been discussed in the text of the paper. The purpose of the two terms q-hat and q-tilde is not clear to me. I suspect there might be also some mistake where t-1 should have been used instead of t.

    For what concerns Table 3 I am puzzled because it seems that huge variations in terms of HD metric are not associated with any variation of dice. If the contours aren’t matching by THAT much, how can it be that dice stays more-or-less stable. In MargExcIL (Ours) I see a HD of 2.30 compared to HD 8.10 of the MarcExcIL (woMem) but exactly the same dice. The culprit might be extremely small (1- or 2-voxel-sized) mis-classifications by the model, especially when not using the memory module in intermediate steps. That could be solved by simple euristics and morphological operations.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The clarity needs improvement to allow for reproducibility

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I would suggest to the author to do a down-to-earth explanation of everything before writing formulas. What are you trying to do, intuitively? Also, they should try not to leave anything to the imagination of the reader.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The clarity of the presentation, which reduces the reproducibility of the method, and the unclear advantage of using the memory module motivate my decision of weakly rejecting the paper. That said, the work has merit and should be re-submitted once it gets sufficiently revised.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    In this paper, the authors aimed to tackle the partially labeled datasets training problem using incremental learning. The paper is clearly written and easy to follow. The general idea of combining multiple datasets in multi-organ segmentation drives its novelty. Evaluated using five public datasets and multiple backbone networks, the proposed method demonstrates its effectiveness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The movtivation of using the partially labeled datasets for multi-organ model training stands for its novelty. 2) The proposed method is clinical practical and the overall pipeline is designed in principle. 3) The results has demonstrated its effectiveness 4) The paper is clearly written and easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I would have some concerns listed as follows: 1) The proposed method seems to learn one organ at a time incrementally. This training process could be tedious and time-consuming if one center contains multiple organ labels.

    2) How to tackle the annotation style difference across different datasets? E.g., Center A would label the parotid’s anterior tip, while Center B would not. The authors might want to discuss this potential labeling issue.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please refer to Section 5 – main weaknesses

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The general motivation drives its novelty. The proposed method has demonstrates its effectiveness. Thus, I would recommend accepting this paper.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces an incremental learning mechanism to learn a single multi-organ segmentation model from partially labelled, sequentially constructed datasets. Able to deal with the catastrophic forgetting issue, the developed method includes a light memory module to stabilize the incremental learning process as well as new loss functions to restrain the representation of different categories in feature space. The experiments performed on organ segmentation from CT scans using five publicly-available datasets reveal the effectiveness of the proposed contributions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Building a single multi-organ segmentation model from partially labelled, sequentially constructed datasets is innovative
    • Incremental learning is applied to multiple organ segmentation for the first time
    • A newly designed light memory module is proposed to further mitigate knowledge forgetting in incremental learning
    • Strong assessment on various datasets using Dice and Hausdorff distance and including comparisons with both existing approaches and ablated versions
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The global loss as well as sub-loss weights should be explicitly defined
    • Both marginal and exclusion losses from Shi et al. could be better described
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The code does not appear to be made available but can be reproduced using the explanations provided. All the datasets used for training and validation phases are publicly-available. The testing phase relies on 3 datasets including 2 which are private.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The method is of high interest for the medical image analysis community. The submitted paper is innovative and very well written. The following comments could be taken into account for further improvements.

    Main comments: 1- In Sect. 2, the multi-teacher single-student knowledge distillation (MS-KD) framework proposed by Feng et al. should be mentioned as related work in the paragraph entitled “MOS with partially labelled datasets” 2- $L_{Marg}$ and $L_{Exc}$ from Shi et al. [21] should be explicitly defined and linked to Eq. 1 and 2. 3- You should explicitly write the global loss and mention how all the sub-losses are weighted

    Minor comments: 4- The last sentence of the paragraph “Framework of IL” (Sect.3. 1) is unclear and should provide the definition of $\Theta_{t}$. 5- Dice and Hausdorff distance comparisons between the proposed approach and existing incremental learning methods - especially MiB [2] - could be confirmed using a statistical analysis through t-tests 6- The comparison with and without memory module report the same performance in both configuration. Your assumption to explain the instability (various field-of-view) should be confirmed by using other datasets. 7- Be careful with the differences of writing between Fig.3 and Eq. 4, 5 and 6 for $l_{mem}$, $l_{same}$ and $l_{opp}$ 8- $b$ (background) should be defined in Sect. 3.1 and not in Sect. 3.2 as it is used in both Eq. 1 and 2

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Strong innovation dealing with how to build a single multi-organ segmentation model from partially labelled, sequentially constructed datasets
    • Incremental learning applied for the first time to multiple organ segmentation
  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All reviewers agree that this is a valuable work with interesting ideas, tackling an important problem, and with good experiments. Although one reviewer gave a 4, they were generally positive and not absolutely confident in their review. Given the effort and contribution of building a multi-organ dataset, along with methodological contributions to incrementally learn a segmentaiton network, I think this work deserves going to an early accept. Having said that, authors are encouraged to address as much as possible reviewer concerns, particularly as to clarity.

    1. Experiments do not show significance (or at the very least error-bars, standard deviations). Cannot all be fit into the giant table, but if these could at least be provided in the supplementary, e.g., bar and whisker plots, that would be very valuable.
    2. Exposition was confusing to R1.
    3. R1 points out some oddities with the HD vs DSC numbers, could authors address this?
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3




Author Feedback

We thank the reviewers and the meta-reviewer for their positive assessment of our work and helpful suggestions for improvement. And we will address the major concerns in the following: Reviewer 1

  1. “For what concerns Table 3 I am puzzled because it seems that huge variations in terms of HD metric are not associated with any variation of dice. If the contours aren’t matching by THAT much, how can it be that dice stays more-or-less stable. In MargExcIL (Ours) I see a HD of 2.30 compared to HD 8.10 of the MarcExcIL (woMem) but exactly the same dice. The culprit might be extremely small (1- or 2-voxel-sized) misclassifications by the model, especially when not using the memory module in intermediate steps. That could be solved by simple euristics and morphological operations.” Response: We used HD95 metric here, so there wouldn’t be only 1- or 2-voxel-sized misclassifications by the model if the number of HD95 is large. This kind of unreasonable large HD95 means false positive predictions out of model tested on some hard examples. It indicates the model has a poorer generalization ability. The introduced memory module can keep the features learned for each organ more robust to generalize well to some hard cases. Dice is averaged by other normal cases. Reviewer 2
  2. “The proposed method seems to learn one organ at a time incrementally. This training process could be tedious and time-consuming if one center contains multiple organ labels.” Response: If one center contains multiple organ labels simultaneously, we can use joint learning methods directly. IL-based method can handle the situation that the different organ labels are coming sequentially. We can train the new coming data based on the parameters trained before, instead of training all data from the scratch. And IL-based method share knowledge via parameters, preventing privacy issues.
  3. “How to tackle the annotation style difference across different datasets? E.g., Center A would label the parotid’s anterior tip, while Center B would not.” Response: We thought the principle of labeling one organ should be same. If there are style difference between different centers, we thought the model will be tuned to the style of latest coming data in some extent. Reviewer 3
  4. “The testing phase relies on 3 datasets including 2 which are private.” Response: The Amos we used in our experiments has been released as a challenge in MICCAI 2022 on the Grand Challenge website.



back to top