Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Zifu Wang, Teodora Popordanoska, Jeroen Bertels, Robin Lemmens, Matthew B. Blaschko

Abstract

The soft Dice loss (SDL) has taken a pivotal role in numerous automated segmentation pipelines in the medical imaging community. Over the last years, some reasons behind its superior functioning have been uncovered and further optimizations have been explored. However, there is currently no implementation that supports its direct utilization in scenarios involving soft labels. Hence, a synergy between the use of SDL and research leveraging the use of soft labels, also in the context of model calibration, is still missing. In this work, we introduce Dice semimetric losses (DMLs), which (i) are by design identical to SDL in a standard setting with hard labels, but (ii) can be employed in settings with soft labels. Our experiments on the public QUBIQ, LiTS and KiTS benchmarks confirm the potential synergy of DMLs with soft labels (e.g.\ averaging, label smoothing, and knowledge distillation) over hard labels (e.g.\ majority voting and random selection). As a result, we obtain superior Dice scores and model calibration, which supports the wider adoption of DMLs in practice. The code is available at \href{https://github.com/zifuwanggg/JDTLosses}{https://github.com/zifuwanggg/JDTLosses}.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_46

SharedIt: https://rdcu.be/dnwBG

Link to the code repository

https://github.com/zifuwanggg/JDTLosses

Link to the dataset(s)

N/A


Reviews

Review #3

  • Please describe the contribution of the paper

    This paper presents novel Dice Semi-metric Losses (DSLs) that are suitable for handling soft labels as opposed to the existing Generalised Dice Loss whose prediction are pushed toward vertices. The losses are validated on the public Qubic, LiTS and KiTS datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • In general, the paper is well written.
    • The paper provides novel Dice semi-metric losses
    • The paper provides a theoretical justification for the losses
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The method validation has some flaws:

    • Results comparison for the Qubic dataset (table 2) is confounded with weighting schemes, which makes it difficult to understand the contribution of the new losses. See details in the comments section.
    • There are no results comparison to other methods for the LiTS and KiTS datasets.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors stated that they will provide the training and evaluation codes.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Review Comments: - Results comparison for the Qubic dataset (table 2) is confounded with weighting schemes, which makes it difficult to understand the contribution of the new losses. In particular, weighting by each rater’s annotation Dice score was applied for almost all structures except Brain tumor T2 where an average weighting was used, probably because the average weighting reached the best BDice for that structure. It is better to use each weighting method separately. - The SoftSeg method is also influenced by the annotators weighing scheme, it would be beneficial to check different weighting schemes for the SoftSeg method as well and compare them with the proposed method. - Table 1: For clarity, I suggest to include the meaning of each column in the table description. Also, I think it is better to switch the “Weighted” and “LS” columns to match the description order of these methods in the text. - Except in table 4, it is not clear which DSL method was used for the datasets evaluation (tables 2,3)- was it DSL1 or DSL2? - The KDE method which is presented in section 3.5 is not thoroughly validated. It is not compared to other methods like “temperature scaling”. Therefore, it is not necessarily superior and I suggest to remove the sentence “We then adopt it as a post-hoc calibration method to replace the temperature scaling to calibrate the teacher in order to improve the performance of the student.” - In section 3.6, the statement “We find models trained with SDL can still benefit from soft labels to a certain extent, this is because…” is too confident and the given explanations does not have a clear evidence. I suggest to replace it with the phrase: “It may be because…” - In section 3.6, there is a grammatical error: “although SDL push” should be “although SDL pushes” In section 3.6, in sentence “SDL is significantly outperformed by DSLs” - please remove the “significantly” phrase as significance tests were not performed. - In supplemental materials in table 8, it looks like the KDE method is sensitive to bandwidth parameter. this should be added as a limitation when the KDE method is discussed. Since the gains in using the KDE method (table 5) are not very large, I am not sure it adds much value to the paper

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes two new Dice losses that can handle soft labels. It provides a theoretical justification for the losses. Evidence shows that it is superior to hard labels and from table 4 it looks superior to the SDL loss. However, the method validation has some issues.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces dice semimetric losses by replacing set notation in function definition with vector functions and L1 norm so that the resulting function satisfies semimetric definition. The proposed losses can handle soft labels, to highlight it authors run experiments on the data with multiple labels, and experiments with knowledge distillation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • addressing a problem of soft labels in Dice loss, which is novel;
    • mathematical notation + proof of the prepositions. The paper is logical and easy to follow, it has all necessary proofs to support the claimed contributions
    • evaluation on the multiple datasets + relevant metrics
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • implementation details and results doesn’t communicate which DSL has been used in Tables 1-3. From section 3.6 we can guess that it’s DSL1. In section 3.2, the sentence “We leverage a mixture of CE and DSLs” doesn’t communicate which DSL (or both) is used.
    • lack of standard deviation or statistical tests in the results. The numbers are often different in the 10 to -3 range given Dice is in the range [0;1], thus can be seen as not significant. It would be more informative to report sd alongside the mean.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The proposed losses are easy to implement. The results can be verified independently.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    please check the main weaknesses section. It would be beneficial for the paper to include statistical testing of the results + comparison to focal loss, focal dice, lovasz loss, jsl losses

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    clear notation, good methodology, wide range of the experiments

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    The authors present a generalization of the popular “Dice Loss” used for training semantic segmentation networks, such that it can now be applied to supervision with soft labels, which might arise from datasets with multiple annotations per case, or in the knowledge distillation setting. Results are presented on three publicly-available datasets (QUBIQ, LiTS, and KITS) and superior performance is reported when using the proposed method, both in the context of aggregating labels and in knowledge distillation. An interesting secondary result is a replication of previously reported results that a knowledge-distilled student can outperform its teacher.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper presents a novel formulation for a loss function in the context of soft labels, and provides a solid theoretical foundation for it.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Since the authors made use of challenge data for this paper, I would have liked to see a more direct comparison between their proposed method and the results of top methods on publicly available leaderboards for these tasks, rather than just testing against their own baselines. I believe their argument is nonetheless convincing, but the argument would be improved if this additional context had been provided.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of this paper is excellent. All experiments were performed on publicly-available data and the source code has been made publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    It looks like the KiTS19 data was used for this paper as a dataset with a single annotation per case. I believe the newer KiTS21 dataset has multiple annotations per case and could therefore have been used in the same way that QUBIC was for inter-rater experiments.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is theoretically novel and technically sound. I believe soft labels are poised to play an increasingly important role in semantic segmentation moving forward and this loss function will be very useful to the community.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper describes an extension of the Dice loss that can naturally incorporate soft labels. There are theoretical arguments backing the effectiveness of this loss, as well as performance increases on three well-known datasets, although two reviewers noted that differences are relatively small and could be statistically non-significant.

    All in all, every reviewer recommended acceptance of this work, and I do not oppose this decision.




Author Feedback

We wish to express our sincere gratitude for your time and efforts in reviewing our paper. We greatly value your insightful feedback and constructive suggestions, and are committed to revising our paper based on these comments, including improvements in writings and experiments.

In the camera ready version, we will ensure to (i) add results of statistical tests, and (ii) clarify that all experiments are conducted using DSL1. In particular, we performed statistical tests using non-parametric bootstrapping method as in BraTS Challenge and we confirm that all our results are significant with p < 0.05.

While we are not permitted to incorporate new datasets or substantially modify the content of this paper at this stage, we plan to delve into these suggestions further in the subsequent journal extension of this paper. For instance, we intend to test our methods on KiTS21 with multi-annotations (Reviewer #1).

We address specific concerns from each reviewer below:

Reviewer #1:

(#1-1) Comparison with leaderboard results. The top-ranking method of QUBIQ20 achieved 77.78% (on the test set), while ours is 78.14% (5-fold average). However, since the test set of QUBIQ20 is no longer publicly accessible, these numbers are not directly comparable. Yet, we have compared our methods with SOTA methods on QUBIQ20. Regarding LiTS and KiTS, as our experiments involve KD, which typically employs smaller models, we refrained from comparing with leaderboard results.

Reviewer #2:

(#2-1) Comparison with other loss functions. Lovasz-softmax loss is mathematically undefined with soft labels as shown in the JML paper. In our paper, SDL/DSL is optimized in combination with CE, but they can also be combined with focal loss. We add experiments comparing Lovasz-softmax, focal-SDL and focal-DSL1 on QUBIQ and their BDice (%) are 73.05, 74.32, 76.81, respectively, confirming the superiority of our loss.

Reviewer #3:

(#3-1) Confounding the weight-scheme and loss functions in Table 2. Our proposed weight-scheme, akin to our loss function, is a novel contribution of this paper. To the best of our knowledge, weighted averaging has not been explored in previous works, but we have found that it can improve upon simple uniform weighting. SOTA methods typically employ a mixture of loss functions, architectural choices, weighting schemes, and training tricks. Therefore, to compare our method with SOTA methods, we have included all of our innovations in the comparison. Besides, even when using uniform weighting for all QUBIQ tasks, we still achieve 77% BDice, outperforming other SOTA methods.

(#3-2) Comparison with other KD methods. We compare our method with [MasKD: Huang et al. ICLR2023.] Using a UNet-R18 student, the performance of their vs. ours are 58.32 vs. 60.11 (LiTS) and 68.40 vs. 69.73 (KiTS), respectively.

(#3-3) Comparison with TS. While in classification tasks, KD is commonly used with a high temperature (>5), many segmentation KD methods [DIST: Huang et al. NeurIPS 2022; CIRKD: Yang et al. CVPR 2022; MasKD: Huang et al. ICLR2023.], including ours, have found that T=1 yields the best result. For instance, KD results of UNet-R18 on LiTS are 60.11 (ours, KDE), 59.31 (T=1, w/o KDE), 59.10 (T=2, w/o. KDE), 58.25 (T=5, w/o. KDE). The inefficiency of TS for calibration in semantic segmentation has also been investigated in [LTS: Ding et al. ICCV 2021.], which motivates us to tailor a KDE method specifically for segmentation.

(#3-4) Sensitivity of KDE to the bandwidth. We admit that this is a limitation of KDE. However, we find that the optimal bandwidth for both LiTS and KiTS is identical (5e-4), and the performance is somewhat monotonic near the optimal value. Thus, we believe this can potentially lift the burden generalizing KDE to other datasets.



back to top