Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Balamurali Murugesan, Sukesh Adiga Vasudeva, Bingyuan Liu, Herve Lombaert, Ismail Ben Ayed, Jose Dolz

Abstract

Ensuring reliable confidence scores from deep networks is of pivotal importance in critical decision-making systems, notably in the medical domain. While recent literature on calibrating deep segmentation networks has led to significant progress, their uncertainty is usually modeled by leveraging the information of individual pixels, which disregards the local structure of the object of interest. In particular, only the recent Spatially Varying Label Smoothing (SVLS) approach addresses this issue by softening the pixel label assignments with a discrete spatial Gaussian kernel. In this work, we first present a constrained optimization perspective of SVLS and demonstrate that it enforces an implicit constraint on soft class proportions of surrounding pixels. Furthermore, our analysis shows that SVLS lacks a mechanism to balance the contribution of the constraint with the primary objective, potentially hindering the optimization process. Based on these observations, we propose a principled and simple solution based on equality constraints on the logit values, which enables to control explicitly both the enforced constraint and the weight of the penalty, offering more flexibility. Comprehensive experiments on a variety of well-known segmentation benchmarks demonstrate the superior performance of the proposed approach. The code is available at \url{https://github.com/Bala93/MarginLoss}

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_55

SharedIt: https://rdcu.be/dnwBP

Link to the code repository

https://github.com/Bala93/MarginLoss

Link to the dataset(s)

https://www.med.upenn.edu/cbica/brats2020/data.html

https://flare.grand-challenge.org/

https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html


Reviews

Review #2

  • Please describe the contribution of the paper

    The authors demonstrate that the implicit constraint imposed by the SVLS is not tunable (it cannot be balanced in relation to the primary objective function). The authors propose an adjustable yet simple linear penalty (based on RELU) to deal with this issue. The authors show in multiple tasks (namely abdominal organ segmentation, cardiac multi-structure segmentation, and brain tumour segmentation) that their method can outperform state-of-the-art calibration models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well-constructed, the key claims are clear and backed — the paper as a whole is well-thought out.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Insights into parameter tuning are lacking.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Ok

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • Insights into parameter tuning are lacking. The authors should include a table/figure with results with different values for lambda. What happens if it is too low or high? How should we adjust this parameter?
    • The authors should include visuals to illustrate some of the key problems of the method. According to ‘3-Patch size’, the size of the patches compromises the performance of the method. The explanation of why this is the case is shallow and requires careful study. What is actually the issue with increasing the patch size? Where is the problem segmentation-wise? does it occur more around tissue interfaces?
    • The authors should consider including visual examples of best and worst cases.
    • Could the authors please report central tendency and dispersion measurements (e.g. mean and standard deviation) in Table 1, 2, 3? It is quite difficult to judge whether there is a ‘significant’ improvement over other methods.
    • Eq 2. Use \left( and \right)?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is clear, interesting, well constructed. See additional comments above.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    No additional comment from my side. To be fair, I respectfully disagree with a few of the comments made by R4, as they might be relevant in competition type scenarios (where the best performance is the one that matters) but irrelevant for showing that proposal makes sense.



Review #3

  • Please describe the contribution of the paper

    The authors explore a flexible solution to the recent SVLS method for model calibration in segmantic segmentation taks by contraining the penalty values equally on logit values. Additionally they analyse the impact on training from an optimisation perspective.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    It is good to see the analysis of an existing method and the pitfalls as well as aiming to see why this occurs and finding a solution. I am also pelased to see the authors analysing the optimisation technique as often we continue to build on the exisitng work without reporting the previous results on our own data. Overall a nice paper that not only covers calibration but also optimisation.

    Good explanation of methods and the reasoning for equality constraints, because of the change in the contraint, it is a novel approach.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I would like to see a grid search strategy or alternate approach to finding optimal hyper parameters. I feel showing the ones utilised/similar values from other papers is good but one can miss other combinations.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good, the code is supplied and additional materials.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    This paper is very important within our field and I appreciate the work. My main comment is as per above on hyper parameters but, also is it possible to repeat experiments atleast three times and get an average value over these runs?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I feel the work is of good quality with minor points that need addressing.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Happy with feedback given to my concerns and satified given the limited time available that the authors repeated the experiments over 3 seeds and they reference the hyperparameter choice to a previous paper [18].



Review #4

  • Please describe the contribution of the paper

    This paper presented a constrained-optimization perspective of Spatially Varying Label Smoothing to calibrate the segmentation model prediction.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The method has sound motivation and clear mathematical formulation.

    2. The writing is easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The experiments does not match the real-word setting. For example, the CT images are cropped to 192×192×30. There is no such prior to crop the image in practice. The algorithms should deal with the original whole images. Following the old settings is not a good reason.

    2. The backbone network is too weak. For the ACDC dataset, the well-konwn nnUNet can easily obtain 90+ in terms of Dice but the reported Dice is 0.828 with CE+DSC loss.

    3. All the experiments were based on 2D network. It’s no longer a common practice to segment 3D images using 2D network in 2023. The author may argue that they followed existing settings [18]. Again, this is not a good reason. Most of the compared methods are open-sourced. It’s not difficult to incorporate them into 3D network, e.g., nnU-Net.

    4. As mentioned in supplementary Table 1, the used datasets have multiple classes. Please report the performance for each class since readers are interested in the performance on challenging targets, such as myo in ACDC and pancreas in FLARE.

    5. Visualized segmentation examples were missed.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility is great since the author provided the core implementation.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please use a strong 3D network (e.g., nnUNet) as the backbone network and it would be great to test the methods on challenging datasets, especially for the tumor segmentation.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The reported improvements were build on weak baseline. The open sourced solution can easily obtain significantly better performance without the proposed method, for example: https://www.nature.com/articles/s41592-020-01008-z

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    Thanks for the great rebuttal. I understand that this work aims to improve the calibration performance of segmentation networks and I didn’t require that this methods can set up a new SOTA. My major concern is the the evaluation.

    1. The paper aims to improve the calibration but still use DSC score as one of the main metrics. However, the naive nnUNet can already obtain a better DSC score without any tricks. I don’t understand why the authors prefer a weak baseline rather than use this open-sourced, user-friendly method, published in highly reputable journal (Nature Methods) as the baseline to validate the improvements of calibration.

    2. As the author mentioned that this idea should also work for 3D images. Why don’t conduct such experiments? The authors listed some references that using 2D networks for 3D images. Howerver, if looking at the whole spectrum of segmentation papers in MedIA or TMI, it is a more common choice to validate an idea with 3D networks for 3D medical images.

    Alghrough the experiments are weak in terms of the baseline network choice, the idea is well motivated and the rebuttal is clear. Thus, I increase my score. Nevertheless, I highly recommend the authors including addition results using well-reconginized baseline methods in the future potential extensions.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper starts from the observation that Spatially Varying Label Smoothing, a technique useful for calibrating segmentation models, is a “rigid” type of regularization (cannot be tuned per class), and then proceed to fix this aspect by adding a linear penalty to the loss function. R2 and R3 liked the paper (R2: “The paper is well-constructed, the key claims are clear and backed -the paper as a whole is well-thought out”, R3: “Overall a nice paper that not only covers calibration but also optimisation.Good explanation of methods.”) and were in favor of acceptance as-is.

    R4 also appreciated that this is a high-quality paper (“The method has sound motivation and clear mathematical formulation”), but noticed some weaknesses related to the experimental setting. Namely, R4 stresses that the experimental setup is not aligned with current practice: unjustified crops for preprocessing the FLARE dataset, oddly low performance in ACDC, 2d segmentation networks instead of 3d.

    After going through the paper myself, I have a couple of comments: a) I agree with R3 that the cropping of FLARE data to 192x192x30 should be justified, maybe by referring to some other well-accepted work that does this? and b) I tend to disagree with R3 on that “It’s no longer a common practice to segment 3D images using 2D network in 2023”. As far as I am aware, the nn-Unet advices to use 2d networks in case there is a strong anisotropy and slice thickness is too strong so as to provide meaningful 3d context, as is the case in the ACDC dataset.

    Anyway, given that R4 is in favor of rejection, I will recommend the paper goes through the rebuttal stage, and let the authors try to convince R4 that their main contribution, the regularization scheme to improve calibration, remains significant despite the possibly sub-optimal architecture and training decisions.

    If space allows, please also address minor concerns by R2 and R3 around statistical significance (“Could the authors please report central tendency and dispersion measurements”, “is it possible to repeat experiments atleast three times and get an average value over these runs?”) and hyper-parameter selection (“include a table/figure with results with different values for lambda. What happens if it is too low or high?”, “I would like to see a grid search strategy or alternate approach to finding optimal hyper parameters.”), as well as reporting metrics per-class so that performance in challenging categories (like the myocardium in ACDC or the pancreas in FLARE), can be better understood.




Author Feedback

We thank the reviewers for the insightful comments. We are pleased that they pointed to the high importance of this work in our field (R3), its novelty (R3) with the key claims properly backed (R2), a sound motivation and clear mathematical formulation (R4), as well as its clarity (R2,R4) and quality (R3). Below, we clarify the main issues raised.

R4: Justification of the score is based on the discrimination performance of UNet with our method compared to nnUNet.

We remind R4 that the aim of the proposed approach is at improving the calibration performance of segmentation networks, and there exist no claims that suggest that the proposed approach attempts to provide a novel state-of-the-art method from a discrimination standpoint. Furthermore, we have included results with nnUNet (please see the next point), where we show that our approach outperforms a current calibration state-of-the-art method (MbLS) when both use nnUnet as backbone. Thus, we kindly request R4 to reconsider the score based on 1) we do not claim a novel method that improves segmentation performance and 2) as suggested, we show that our method is model-agnostic, and can therefore be used with any backbone, improving their calibration performance over existing approaches.

R4: Weak backbone.

Even though recent networks may achieve better results, we use the standard UNet for concept proof and validation. Our method is model agnostic and can be used with any other network, including nnUNet. Due to time constraints, we repeated the experiments and compared to MbLS (second best performing approach), whose results [DSC,ECE] are: MbLS | Ours FLARE [0.891,0.031] | [0.896,0.025] BraTS [0.865,0.095] | [0.870,0.080] ACDC [0.886,0.057] | [0.884,0.056] This shows that the improvement gains, particularly in calibration, are maintained regardless of the backbone.

R4: Experiments do not match real-world settings.

We respectfully disagree with this statement, which is indeed misleading. This concern is based only on the cropping of FLARE, which is a common practice in many datasets. For example, SVLS [7] also crops all the datasets used in their empirical validation, including BraTS. Furthermore, for a fair comparison we used the same setting as in [18], a recent MedIA’23 paper. We do not understand why using accepted settings in a highly reputable journal (MedIA) or a flagship medical image computing conference (IPMI) is not sufficient for R4 to justify our choice.

Visual examples (R2,R4) and other analysis (std dev, R2; hyperparameters search, R3).

We apologize to R2 and R3 for having dedicated less space to their very valuable comments, as we intend to positively address R4 concerns, who has shown more criticism to our work. While we agree that adding these concerns could strengthen the paper, due to length constraints on the conference version we had to prioritize other more important ablation studies, such as the ones motivating empirically our design choices. We can, however, reduce the supplemental material and add, for example, qualitative results across different approaches, as well as mean and standard deviation values (we have run the experiments over 3 seeds). For R3, we agree that other combinations may eventually result in performance gains, but for the sake of fairness we kept the same value for the hyperparameter controlling the constraint as [18]. Needless to say, we appreciate this feedback, which can be included in a potential journal extension.

R4: Class-wise results.

We did not include class-wise results to maintain the symmetry, and readability, of the main table. We will include per class results in the supp material.

R4. 2D vs 3D.

For a fair comparison, our design choices are based on the very recent MedIA work in [18]. Furthermore, as already stated, this work is a concept proof and a validation of a novel calibration approach, and we do not see any reason why the findings observed would not translate to the 3D scenario.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper originally received a Strong Accept by R2, an Accept by R3, and a Reject by R4. R2 and R3 appear happy with authors’ response and kept their recommendation, so the only dissension comes from R4, who initially had concerns regarding the adopted baseline for comparison. R4 argued that this baseline was too weak, not following modern practices (using nnUnet). After reading the rebuttal letter, R4 updated their rating to Weak Reject, and justified this by insisting that nnUNet should have been used.

    On the other hand, from the authors’ response, it does seem like they carried out experiments with nnUNet, at least partially given time constraints, and results seemed consistent. Regarding this issue, R2 explicitly mentioned that not using nnUNet was a too harsh reason for rejection, and I tend to agree with them. Therefore, I will vote for acceptance of this paper, and pass to the authors R4’s recommendation of maybe using nnUNet or any other stronger backbone in a potential follow-up work.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a new regularization technique for calibrating segmentation models. This paper had already good scores after the 1st round ; missing were some justification on their choices. After the rebuttal, the authors were aksed to provide arguments on how « possible sub-optimal architecture and training decisions » (quoting the main MR) may impact their contribution. The rebuttal did bring light on these subjects ; hence I recommend acceptance of this paper.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have clarified most of the major concerns of reviewers, specifically cropping data, doubt with UNet vs nnUNet as the backbone, real-world settings, and visual examples, including the promise to add Class-wise results and more visualisation in the supplementary. Although the method should be validated on 3D networks, I found the proposed method interesting, and the results demonstrate the effectiveness of the method and thus suggest acceptance.



back to top