Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Sofie Tilborghs, Jeroen Bertels, David Robben, Dirk Vandermeulen, Frederik Maes

Abstract

Albeit the Dice loss is one of the dominant loss functions in medical image segmentation, most research omits a closer look at its derivative, i.e. the real motor of the optimization when using gradient descent. In this paper, we highlight the peculiar action of the Dice loss in the presence of missing or empty labels. First, we formulate a theoretical basis that gives a general description of the Dice loss and its derivative. It turns out that the choice of the reduction dimensions Phi and the smoothing term epsilon is non-trivial and greatly influences its behavior. We find and propose heuristic combinations of Phi and epsilon that work in a segmentation setting with either missing or empty labels. Second, we empirically validate these findings in a binary and multiclass segmentation setting using two publicly available datasets. We confirm that the choice of Phi and epsilon is indeed pivotal. With Phi chosen such that the reductions happen over a single batch (and class) element and with a negligible epsilon, the Dice loss deals with missing labels naturally and performs similarly compared to recent adaptations specific for missing labels. With Phi chosen such that the reductions happen over multiple batch elements or with a heuristic value for epsilon, the Dice loss handles empty labels correctly. We believe that this work highlights some essential perspectives and hope that it encourages researchers to better describe their exact implementation of the Dice loss in future work.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_51

SharedIt: https://rdcu.be/cVRy5

Link to the code repository

https://github.com/JeroenBertels/dicegrad

Link to the dataset(s)

BRATS: https://www.med.upenn.edu/sbia/brats2018/data.html ACDC: https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper analyses the commonly used Dice loss in the context of missing or empty labels. It provides a formulation for the loss in terms of reduction dimensions ‘phi’ and smoothing term ‘epsilon’, and demonstrates how their choice influences segmentation results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper provides a formulation that generalizes several loss functions in the context of missing or empty labels.
    2. It demonstrates how to tune the parameters ‘phi’ and ‘epsilon’ based on the problem at hand (correctly handling missing labels or empty labels.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The introduction section is not clear and not helpful for understanding the context of the paper. The topics of missing and empty labels are barely mentioned there.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors stated that they will release the code and the related material for the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. The article talks about the dice loss in the context of missing and empty labels. However, the introduction doesn’t explain the need behind it and the relevant use cases for each one of them.
    2. The introduction talks about general papers about the dice loss, but mentions only a handful of works about the topic of missing or empty labels. The relevant papers are cited much later in the methods section. This makes it difficult to follow through the article.
    3. In the introduction the reduction dimension “phi” is mentioned without any background or formula to understand what it is referring to. It becomes clearer only later.
    4. DL_CI and DL_BCI are mentioned in section 2.1 but are not used or compared to in the experiments. Their function is therefore not clear.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has a significant contribution to the understanding of dice loss in the context of missing and empty labels. It provides a generalization of the loss and shows the impact of the parameters on segmentation performance in this context. However, its organization and clarity are lacking.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    The authors made clarifications about several aspects, which was the weak part in the paper (e.g. the difference between missing and empty labels). If they will make the changes they indicated in the rebuttal and the paper will be more clear, I think this work deserves “strong accept” because of its importance (widely used segmentation loss).



Review #2

  • Please describe the contribution of the paper

    This paper investigates the underlying mechanism of Dice loss by checking its derivative during model training. Compared to some existing works that study the Dice loss, this paper provides more details about the derivative as well as the reasoning about the choice of $\Phi$ and $\epsilon$. Based on some theoretical analysis, this paper proposes heuristic combinations of $\Phi$ and $\epsilon$ that help train the segmentation network under missing or empty labels.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) This paper provides a theoretical analysis on the derivative of Dice loss, as well as the effects of $\Phi$ and $\epsilon$. Such an analysis provides a deeper understanding of Dice loss. (2) The experimental setups are reasonably designed, and the evaluation results verify the claims made in the paper to some extent.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) I would like to point out that this paper is not easy to follow. Some descriptions in the paper are confusing and unclear. For example, for the BRATS dataset, what are partially and fully labeled data? In addition, the heuristic in Section 2.3 is not easy to understand. For example, what does it mean for the sentence “A very simple strategy would be to let …”? (2) In the experiment, it is mentioned that the marginal Dice loss and the leaf Dice loss are compared. However, I did not see the results for these two losses in the experiment. Do these two losses correspond to some certain B values in Table 1? This is confusing. (3) To demonstrate the heuristic regarding the choice of $\epsilon$, this paper sets $\epsilon$ to the expected volume of the groundtruth, which turns out to work well. To better illustrate this heuristic, it is necessary to conduct experiments where other $\epsilon$ values are chosen. This would provide a more effective comparison for the impact of $\epsilon$ values.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper mentions that the code will be released upon acceptance. This would be helpful in reproducing the results. In addition, this paper also offers some training details for reproducing the results. But these details may not be completely sufficient to obtain the experimental results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    (1) The writing of this paper should be improved to make it easier to understand. In particular, it would be better to additionally use plain language to explain the insights and intuition. (2) Additional experimental results are required. For example, it would be helpful to provide the results by other $\epsilon$ values. Also, it is necessary to provide results of marginal Dice loss and leaf Dice loss.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Major factors: the paper writing and experimental verification.

    Many descriptions in the paper are not clear. This prevents the understanding of this paper. For the experimental verification, some additional experimental results are needed to further verify the proposed idea.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    After reading the response from the authors, I think they have addressed some of my (and other reviewers’) concerns. However, I would like to stick to my previous rating for this paper, i.e., weak reject. My concerns for this paper include:

    1. The writing of the paper needs improvement. Although the authors clarified some critical points in the rebuttal (such as the difference between missing and empty labels), overall the paper is still difficult to follow. I hope the authors would further polish the paper so that the paper can be more accessible to the readers.
    2. This paper lacks sufficient introduction or discussion for other relevant works on empty and missing labels.
    3. This paper lacks some experiments. Such issues have also been raised by other reviewers. For example, this paper lacks study on the behavior of DL_CI and DL_BCI. Besides, it does not investigate the impact of different values of epsilon on the performance. Note that the authors mentioned that they would plan to add these experiments in the future; but this is not guaranteed.

    Therefore, in my opinion this paper could be further improved.



Review #3

  • Please describe the contribution of the paper

    The authors study the Dice loss in the frequent context of missing or empty labels. They provide a useful formulation of the loss and clearly present the different possible subsets used for the reduction dimensions (image, batch, class), i.e. dimensions on which to aggregate the intersections and unions. They show the importance of these dimensions and of the smoothing term when dealing with missing or empty labels and propose well motivated heuristics for setting these parameters. Segmentation experiments on two public datasets (BRATS, ACDC) are performed to illustrate the effect of the heuristics, with quantitative and qualitative results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and structured. Missing and empty labels are common problems in medical image segmentation that will benefit the clear formulations and proposed heuristics. Beside the limitations mentioned in the weaknesses, the experiments are well designed and described, the results support the hypotheses.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    An experiment with real empty labels would be needed (cases or patches with only background, instead). Here the results show that the model starts learning when to predict empty labels when psi and epsilon are designed to do so, but only as a negative result (lower DSC). In the presence of real empty labels, we should see an increase of performance by reducing false positives in the empty maps.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Mention to release the code upon acceptance, public datasets

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    As mentioned above, I would recommend experiments with real empty labels, to show that it reduces false positives in cases with only background, e.g. patches that do not contain a tumor, cases with and without a lesion etc. Other segmentation losses could be briefly mentioned in the introduction The main difference between missing and empty labels and their implications should be discussed. False positives to take into account in the case of empty labels? It comes later in 2.2 and 3, but it would be good to explicitly mention it in the introduction. In 2. I would remove “most general case”, as it is in the case of segmentation. The intermediate part of eq. (1) could be removed.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The formulation is important. I agree with the authors that it could help other authors to clearly present their Dice loss implementation. The heuristics are well motivated and results seem promising. The experiments are not sufficient to fully illustrate the approach

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Some clarifications were given by the authors. I maintain the accept suggestion for this interesting paper and encourage the authors to extend this work as mentioned.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper analyses the commonly used Dice loss in the context of missing or empty labels. It provides a formulation for the loss in terms of reduction dimensions ‘phi’ and smoothing term ‘epsilon’, and demonstrates how their choice influences segmentation results. The reviewers noted the importance of the topic. In their feedback authors should discuss the difference between empty and missing labels and provide more motivation/intuition.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3




Author Feedback

We would like to thank the reviewers for the constructive feedback. We addressed their questions and remarks below, including minor textual changes.

[MR1, R1, R3] There is a main comment to clarify the difference between missing and empty labels. Thank you for pointing this out. We will add a more intuitive and motivational explanation in Sect. 1, by changing “…to deal with missing labels in particular in [7] and [15].” to “…to deal with ‘missing’ labels [7,15], i.e. a label that is missing in the ground truth even though it is present in the image.”. We will also change “…Dice loss by taking a closer look at its derivative…” to “…Dice loss, especially in the context of missing and empty labels. In contrast to ‘missing’ labels, ‘empty’ labels are labels that are not present in the image (and hence also not in the ground truth). We will first take a closer look at the derivative…”. This should make the introduction and the rest of the paper easier to follow.

[R1] The reduction dimension ‘phi’ is unclear in the introduction. We would like to add a reference here: “…the reduction dimensions ‘phi’ given (Sect. 2.1).”.

[R1] Papers on missing or empty labels are cited in methods while only a handful are mentioned in the introduction. It was inevitable to keep the introduction in Sect. 1 on missing and empty labels short, but we agree we condensed it too much (see first comment).

[R1] DL_CI and DL_BCI (Sect. 2.1) not used in the experiments. We described DL_CI and DL_BCI for completeness and to link our work with methods in literature. Experiments on multiple multiclass datasets to assess the behavior of DL_CI and DL_BCI are planned in future work.

[R2] Experiments with other epsilon values are needed. As indicated by ‘A very simple strategy…’ in Sect. 2.3, our choice for epsilon is not unique. The goal in this work was to demonstrate that an epsilon larger than usual has interesting properties without claiming optimality. Further analyzing different values of epsilon is mentioned in the discussion for future work.

[R2] I did not see the results for the marginal and leaf Dice losses. We kindly refer to the end of page 6 for these results. These losses were only applied to the ACDC dataset, since they were designed for multiclass segmentation.

[R2] Reproducibility: “…details may not be completely sufficient…”: We are convinced the details we provide are in line with most other papers, although they might be indeed insufficient for exact replication. We ensure complete details and will clarify this by extending “…we plan to release all the code…” with “…necessary for exact replication of the results upon acceptance, including preprocessing, training scripts, statistical analysis, etc.”

[R2] For BRATS, what are partially and fully labeled data? This information is available under Sect. 3. To further clarify, we will also rephrase “To set up the missing or empty label task, …” to “To construct a partially labeled dataset for the missing and empty label tasks, …”.

R3 An experiment with real empty labels (no tumor) is needed. Since many benchmark datasets contain no empty label subjects (e.g. no tumor), we chose the more difficult discrimination between HGG and LGG tumors. We expect similar observations in the tumor versus no-tumor scenario. We find the idea to investigate false positive predictions in a patch-based setup very interesting. As a translation from a dataset-level to a subject-level analysis, without large methodological differences, we would like to keep these experiments for an extended version of the paper.

R3 Other segmentation losses could be briefly mentioned in the introduction. We would like to keep the current introduction on segmentation losses concise, but we will connect with other segmentation losses in an extended version.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Reviewers’ concerns which mainly related to the lack of clarity, missing details and insufficient experimental validation were not fully addressed in the rebuttal. It seems that the paper (though presenting an interesting study) needs more work to fit the standard of a MICCAI publication. I urge the authors to throughly revise the paper for the camera ready.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    .

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    While this paper presents an interesting study on the behaviour of the Dice loss, I agree with the general remark that the paper is very difficult to follow and requires important changes to clarify aspects that are not well explained in the current version. These modifications would certainly require another round of reviews for verification. In a conference format, this, however, cannot be guaranteed.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10



Meta-review #4

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Both reviewer and AC recommendations on this paper were split with a large divergence. The PCs thus assessed the paper reviews, meta-reviews, the rebuttal, and the submission. It is noted that the reviewers all appreciated the importance of the work and that it is interesting to investigate the underlying mechanism of Dice loss. While clarity was pointed out as a main concern in the review process, the majority of the reviewers found that the authors provided satisfying clarifications to the issues raised. The PCs agreed with the convincing arguments of the supporting reviewers and felt that the weaknesses as pointed out were outweighed by the strengths listed. The final decision of the paper is thus accept.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



back to top