Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Constantin Ulrich, Fabian Isensee, Tassilo Wald, Maximilian Zenk, Michael Baumgartner, Klaus H. Maier-Hein

Abstract

The medical imaging community generates a wealth of data- sets, many of which are openly accessible and annotated for specific diseases and tasks such as multi-organ or lesion segmentation. Current practices continue to limit model training and supervised pre-training to one or a few similar datasets, neglecting the synergistic potential of other available annotated data. We propose MultiTalent, a method that leverages multiple CT datasets with diverse and conflicting class defini- tions to train a single model for a comprehensive structure segmenta- tion. Our results demonstrate improved segmentation performance com- pared to previous related approaches, systematically, also compared to single-dataset training using state-of-the-art methods, especially for le- sion segmentation and other challenging structures. We show that Mul- tiTalent also represents a powerful foundation model that offers a su- perior pre-training for various segmentation tasks compared to com- monly used supervised or unsupervised pre-training baselines. Our find- ings offer a new direction for the medical imaging community to ef- fectively utilize the wealth of available data for improved segmenta- tion performance. The code and model weights will be published here: https://github.com/MIC-DKFZ/MultiTalent

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_62

SharedIt: https://rdcu.be/dnwBW

Link to the code repository

https://github.com/MIC-DKFZ/MultiTalent

Link to the dataset(s)

http://medicaldecathlon.com/

https://www.synapse.org/#!Synapse:syn3193805/wiki/217760

https://zenodo.org/record/1169361#.YiDLFnXMJFE

https://structseg2019.grand-challenge.org/

https://competitions.codalab.org/competitions/21145

https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT

https://kits19.grand-challenge.org/

https://amos22.grand-challenge.org/Instructions/

https://zenodo.org/record/6802614


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a new training method with modified dice loss function to leverage partial labeled CT datasets with diverse classes, which can obtain a single model for a universal structure segmentation

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Simple and useful: the proposed method can handle classes that are absent in one dataset but annotated in another during training, retain different annotation protocol characteristics for the same target structure, and allow for overlapping target structures with different levels of detail.

    2. Flexible and scalable: this method can not only combine multi-dataset training to generate one model to predict all classes present in any utilized dataset, but also serve a pre-trained model for new tasks.

    3. Convencing performance gains: experimental results demonstrated that this method outperformed state-of-the-art segmentation networks that were trained on each dataset individually.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The loss values for different classes were added rather than averaged. This could lead to large fluctuations (e.g., batch 1 has 13 classes while batch 2 only has one class). Please show the loss curve. The motivation of using BCE+Dice loss is not clear. Why not using focal loss or other sota loss functions, e.g., boundary loss, HD loss

    2. The authors mentioned that “We manually selected a patch size of [96, 192, 192] and image spacing of 1mm in plane and 1.5mm for the axial slice thickness, which nnU-Net used to automatically create the two CNN network topologies.” nnU-Net can automatically generate these fingerprints. What’s the motivation to manually select them?

    3. In Fig. 2, bar plots are not a proper way to show segmentation results. Please use box plots or violin plots.

    4. Table 2. p values are missed.

    5. Please report the number of network parameters/flops.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This work has great reproducibility since the datasets are publicly available and the method description is clear.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Parameter-Efficient Fine-Tuning could be a better way for transfer learning. https://github.com/huggingface/peft

    Please address the suggested modifications in the main weaknesses.

    AMOS also provide MR images. It would be great if this method can be extended to multi-modality. This will be real foundation models!

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The present method is useful to integrate partial labeled datasets. The idea is simple and flexible and the experiments are comprehensive and convincing.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    The authors have addressed all my concerns. So I raise the score.



Review #2

  • Please describe the contribution of the paper

    The authors proposed a new method that leverages multiple CT datasets with diverse and conflicting class definitions to train a single model for a comprehensive structure segmentation. The experimental results demonstrate improved segmentation performance compared to previous related approaches, systematically, and also compared to single-dataset training using state-of-the-art methods, especially for lesion segmentation and other challenging structures. The authors showed that MultiTalent also represents a powerful foundation model that offers a superior pertaining for various segmentation tasks compared to commonly used supervised or unsupervised pre-training baselines. The findings offer a new direction for the medical imaging community to effectively utilize the wealth of available data for improved segmentation performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors proposed a new method that leverages multiple CT datasets with diverse and conflicting class definitions to train a single model for a comprehensive structure segmentation. The experimental results demonstrate improved segmentation performance compared to previous related approaches, systematically, and also compared to single-dataset training using state-of-the-art methods, especially for lesion segmentation and other challenging structures. The authors showed that MultiTalent also represents a powerful foundation model that offers a superior pertaining for various segmentation tasks compared to commonly used supervised or unsupervised pre-training baselines. The findings offer a new direction for the medical imaging community to effectively utilize the wealth of available data for improved segmentation performance. In general, this work is a very engineering application, not a standard technical research work, but it’s very useful for medical image segmentation tasks, the results are very encouraging and promising. I believe this work can bring a high impact and catch much attention in the MICCAI community if the code and pre-trained model are released.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The method is too simple, just a new modified loss function, maybe it limited the novelty of this work. But, I still want to point out that a simple method may have better robust and generalizable.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Yes, I believe the paper can be reproduced.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    If possible, the authors can show the statistical results across these methods (Multiple groups comparisons, t-test).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The strengths of the paper.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes an approach for multi-class segmentation using an adaptive loss function and a unified model. The proposed approach was evaluated on various datasets with different segmentation tasks, and the results showed significant improvements compared to existing methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Multi-dataset segmentation using a unified model is a crucial and hot topic in the medical field. The concept of using an adaptive loss function for multiple classes is simple, it can also be easily replicated. The authors also performed evaluations on multiple datasets and tasks, which is a merit.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The novelty is limited. The main contribution is the modification of the loss function to make it adaptive to multi-class tasks. This training strategy may have been employed in other multi-task segmentation papers, and thus, more related works should be included.
    2. The motivation needs to be clearly stated. It is unclear whether this approach works for partial-annotated segmentation. Besides, why do the authors adopt the unsupervised pre-training settings?
    3. The paper lacks detail regarding experimental settings and the implementation of some compared methods (such as [22], [31], and [32])
    4. The improvements reported in the paper are subtle, based on table 1&2. Furthermore, the supplementary material suggests that the resenc U-Net outperforms the proposed MultiTalent model on many tasks. The improvements for MultiTalent are slight.
    5. The improvements are subtle. Also In your supplementary, it seems the resenc U-Net performs better on many tasks, and the improvement of MultiTalent is subtle.
    6. Some descriptions are vague and there are quite a few textual problems(spelling, grammar).
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    easy to reimplement

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. What is the motivation of using pre-training and the statement of “MultiTalent offers a superior pretraining for ..” seems not accurate. In table 2, training a model from scratch can perform even better than using self-supervised pre-training, and the improvement achieved with MultiTalent is limited.
    2. The terms “single model” and “(pretrained*)” need to be better defined. It is unclear how these compared models were trained for each dataset, and why nnUNet has two models.
    3. Some typos, such as “100.000”
    4. The authors demonstrate their method has shorter training and test time. It would be better to provide the training time and the number of parameters.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper lacks of novelty and related works. The authors also miss some details regarding experimental settings and compared methods. The improvements are subtle.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Based on the author’s rebuttal, I agree that the leaderboard results are persuasive. Despite my concern about the novelty of this work, I recognize the value of their published models and codes to the community. I am willing to revise my rating to accept the submission.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The proposed method for multi-class segmentation using an adaptive loss function and a unified model demonstrates improved segmentation performance compared to existing methods. The experimental results are comprehensive and convincing, especially for lesion segmentation and challenging structures. The paper highlights the flexibility and scalability of the proposed approach, allowing for the integration of multiple datasets and serving as a foundation model for various segmentation tasks. The clarity and organization of the paper are rated as very good. The reproducibility of the work is also satisfactory. However, the paper has several weaknesses that need to be addressed. Firstly, the loss values for different classes are added instead of being averaged, potentially causing training instability. Secondly, the manual selection of certain parameters lacks a clear rationale. The use of bar plots in Figure 2 may not be optimal for visualization. Table 2 is missing p-values, crucial for statistical analysis. More details about the network parameters, computational cost, implementation of compared methods are needed. Based on the reviews, the meta-reviewer invites this paper for a rebuttal which should carefully address the above weaknesses.




Author Feedback

We thank all reviewers for their valuable and insightful reviews. First, we will address the concerns raised by reviewer 1 (R1) and the meta-reviewer (MR).

Adding the loss for each annotated class (R1 & MR): Averaging the loss for all classes that were annotated would heavily impact the gradients for each class by the number of annotated classes in each dataset. The magnitude of the loss e.g for the liver head from D1 (2 classes) would be 6 times as big as for D7 (13 classes). Additionally, the magnitude of the loss for each head would also be influenced by the number of classes of the other patches within the batch. Our approach weights each class equally independent from the dataset and gradient clipping captures any potential instability that might arise from a higher loss magnitude. Dividing the loss by the total number of classes would lead to very small loss values and vanishing gradients. We extended the explanation in the final version.

Manual selection of hyperparameters (R1 & MR): In general, we tried not to change too many of nnU-Net’s well-established design choices and therefore adopted e.g. the combination of CE and Dice loss. However, nnUNet’s fingerprint extraction operates on a per-case basis as it was designed for single datasets. In our setting, datasets have a highly heterogeneous number of images which would cause some datasets to dominate the automatic hyperparameters selection. Therefore we manually selected e.g. the spacing configuration which we compared with the automatically selected nnU-net default. We added a section to clarify this.

As R1 and MR suggested, we added a significance test to Tab. 2 (E.g. for the BTCV dataset a U-Net pre-trained with MultiTalent vs from scratch has a p-value of 0.003 and for the Resenc U-Net of 0.034). We also changed Fig. 2 to boxplots. The comparison of each point within each boxplot should be taken with care because e.g. a liver segmentation Dice represents a completely different Dice distribution than e.g. a tumor dice.

FLOPs & Param. count & implementation details (R1 & MR): We now report the network flops F & param. P: Following the naming of Fig. 2: e.g. MT U-Net: 7.20e+11 F, 2.932e+07 P and e.g. for D1 with 2 classes: nnU-Net default: 8.09e+11 F, 3.120e+07 P & U-Net: 7.15e+11 F, 2.929e+07 P. There are only minimal differences between the counts/flops of Multitalent and the baselines with the differences originating from the different segmentation heads. We already mentioned the training time and GPU memory requirements (sections 2.3 and 3). Additionally, all major deviations from nnU-Net are specified in the manuscript and the code will be published.

Moving to the concerns of R3, who criticized the “subtle” improvements of MultiTalent and its novelty. In response, we would like to highlight that apart from the significantly higher training and inference efficiency, we provide the first solution that matches or even outperforms SOTA single dataset methods. In addition, MultiTalent outperforms all related work [22, 31, 32] that was also trained with partially labeled datasets on a public leaderboard. To this end, we would like to emphasize the importance of public leaderboards for a fair evaluation. While we did not reimplement these methods, all the necessary details can be found in the respective publications.

The proposed novel training strategy, MultiTalent, deals with the challenging problem of partially annotated datasets (not to be confused with sparse annotations), therefore it can not be compared to “any multi-task segmentation [method]”. We now elaborate on this distinction, as this might have caused confusion. Furthermore, R3 asked why we compare ”to the unsupervised pre-training settings”. Our model is uniquely able to be trained simultaneously on a large variety of datasets, resulting in a foundation model that can serve as a starting point for improved learning of new unseen tasks, which is also the objective of unsupervised pre-training.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper presents a novel training method utilizing a modified dice loss function to leverage partial labeled CT datasets with diverse classes, resulting in a single model for universal structure segmentation. The experimental results demonstrate improved segmentation performance compared to previous approaches, especially for challenging structures like lesion segmentation. The method exhibits flexibility and scalability, accommodating classes absent in one dataset but annotated in another, while retaining different annotation protocol characteristics. The comprehensive evaluations on various datasets and tasks validate the effectiveness of the proposed approach. The paper originally was required to address several issues such as refining the loss function calculation, providing clearer explanations and more details, improving visualizations, and reporting p-values and network parameters/flops. The rebuttal seems to well address the concerns from all reviewers and received a consensus in accept.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal is quite thorough and has addressed the reviewers’ concerns regarding the detailed analysis of the loss function, further analysis of the performance, and efficiency analysis. Overall this paper has sufficient contributions and the performance is promising. The final version should include more discussions to highlight the novelty of the paper.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The weaknesses of this paper were dealing with questions on the loss and missing information. The authors made a great job addressing the various questions and providing the needed details, as per the reviewer’s new rating after rebuttal. Hence I recommend acceptance of this paper.



back to top