Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Thierry Judge, Olivier Bernard, Woo-Jin Cho Kim, Alberto Gomez, Agisilaos Chartsias, Pierre-Marc Jodoin

Abstract

Aleatoric uncertainty estimation is a critical step in medical image segmentation. Most techniques for estimating aleatoric uncertainty for segmentation purposes assume a Gaussian distribution over the neural network’s logit value modeling the uncertainty in the predicted class. However, in many cases, such as image segmentation, there is no uncertainty about the presence of a specific structure, but rather about the precise outline of that structure. For this reason, we explicitly model the location uncertainty by redefining the conventional per-pixel segmentation task as a contour regression problem. This allows for modeling the uncertainty of contour points using a more appropriate multivariate distribution. Additionally, as contour uncertainty may be asymmetric, we use a multivariate skewed Gaussian distribution. In addition to being directly interpretable, our uncertainty estimation method outperforms previous methods on three datasets using two different image modalities.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_21

SharedIt: https://rdcu.be/dnwAS

Link to the code repository

https://github.com/ThierryJudge/contouring-uncertainty

Link to the dataset(s)

https://www.creatis.insa-lyon.fr/Challenge/camus/

http://db.jsrt.or.jp/eng.php


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors apply and extend a Differentiable Spatial to Numerical Transform (DSNT) network to landmark localization in order to learn contours for segmentation and predict uncertainty. They extend the DSNT approach to predict univariate Gaussian, bivariate Gaussian and skewed bivariate Gaussian mean and covariance matrices for 2D landmark coordinates. They show the learned uncertainty distributions for the landmarks can be used to accurately estimate uncertainty for the segmentation maps built on them, outperforming standard pixel-wise segmentation uncertainty techniques.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The uncertainty estimation method is very novel to me, with an intuitive and theoretically convincing (relative to deep-learning papers) objective function.
    • There is limited work in anatomical landmark localization uncertainty (although not no work!), and this paper is a fresh contribution to the community working in this area.
    • The empirical results are very good overall.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Authors claim there is no work in uncertainty estimation in landmark localization, but please see these references: 1) Thaler, Franz, et al. “Modeling Annotation Uncertainty with Gaussian Heatmaps in Landmark Localization.” Machine Learning for Biomedical Imaging 1.UNSURE2020 special issue (2021): 1-10. 2) Schöbs, Lawrence, Andrew J. Swift, and Haiping Lu. “Uncertainty Estimation for Heatmap-based Landmark Localization.” IEEE Transactions on Medical Imaging (2022).

    • In Table 1. for the landmark prediction using MC-Dropout, can the authors clarify the objective function for this method. Is it using the univariate objective function, but using the mean prediction of multiple passes, is it using multiple passes of the method by their reference [21], or something else?

    • To fully understand their univariate model, I had to read the author’s reference [21] for a full understanding since it is essentially the same method they are using as their baseline. I think it would be beneficial to readers to communicate more clearly that this is directly built on [21] with a sentence to direct readers to this paper ([21]) for a full understanding.

    • The results from N1, N2 and SN2 do not show a particularly clear or logical pattern to me. It is a concerning that a method that dominates one dataset performs poorly in another dataset e.g. SN2 in PCUS dataset, which the authors mention.

    • There is limited discussion on why N1 is often better than N2 despite it being more expressive. The paper could be improved with more discussion on comparing the results of the author’s proposed methods to each other.

    • No std of results are given.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    As long as the author’s provide code, reproducibility is not a problem.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • I think this is an excellent work that would benefit from extending to a journal. It would be beneficial to talk about more work in the anatomical landmark localization uncertainty area, and compare with their methods. I believe this method is better than current methods and would be very beneficial to the community. You could apply the method to the common benchmark datasets here (e.g. ISBI 2015 Cephalometric dataset, https://ieeexplore.ieee.org/abstract/document/7061486/) and compare with existing works.

    • Further discussion/comparison of the author’s proposed methods would improve the paper. As I mentioned in weaknesses, it is concerning that results of the proposed methods fluctuate between datasets.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • It is a novel and empirically convincing uncertainty estimation technique for landmark localization that will be beneficial to the community. The result’s sensitivity the dataset is what is preventing me from giving a higher score.
  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    Author’s addressed my questions.



Review #2

  • Please describe the contribution of the paper

    This paper proposed an asymmetric contour uncertainty estimation methods for medical image segmentation by redefining the conventional per-pixel segmentation task as a contour regression problem, which allows for the modeling of the uncertainty of contour points.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea that estimate contour uncertainty is interesting and meaningful to make sure the segmentation quality and reliability.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The mention of aleatoric and epistemic uncertainty is confusing. For example, in [15], the model is proposed to combine aleatoric and epistemic uncertainty in one model. A detailed explanation and and comparison between [15] and this work should be provided to outline the main contribution.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    this paper is reproducible since the authors claimed they will release the code

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. There are some confusing or wrong explanation, e.g., why the authors give the expression ‘Aleatoric uncertainty estimation is a critical step in medical image segmentation’? Why is ‘aleatoric uncertainty’ but ‘not ‘epistemic uncertainty’?

    “Most techniques for estimating aleatoric uncertainty for segmentation purposes assume a Gaussian distribution over the neural network’s logit value modeling the uncertainty in the predicted class”, which is not true. It should be epistemic uncertainty

    2.’ While various uncertainty methods have been investigated for pixel-wise image segmentation, no uncertainty method for pointdefined contours exists to date.’, this expression is not true. There are a lot of works that work on the uncertainty estimation of contours, for example,
    (1)Logistic regression, neural networks and Dempster–Shafer theory: A new perspective (2)A neural network classifier based on Dempster-Shafer theory (3)Lymphoma segmentation from 3D PET-CT images using a deep evidential network which are more simple and effective to obtain segmentation contours. A meaningful comparison between the related works is necessary.

    1. What’s the theoretical guarantee by defining confidence as the inverse of uncertainty? Please give the references.
    2. The difference between Reliability diagrams in Fig. 4 need to be better explained.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors have problems in introducing and distinguish aleatoric and epistemic uncertainty, which make the paper confusing and the comparison results not conclusive.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Here are some of my comments about this paper: 1.In the abstract section, the authors claimed that ‘Aleatoric uncertainty estimation is a critical step in medical image segmentation,’ which is not true. As mentioned in the authors’ feedback, ‘uncertainty can be epistemic (from the model), and aleatoric (from data) .’Epistemic uncertainty can be quantified and reduced to enable a better segmentation performance, while aleatoric uncertainty cannot be reduced but only be identified and quantified. Thus, researchers usually work on epistemic uncertainty to improve the performance and work on aleatoric uncertainty to know where and when the model will fail. I think the paper is not well explained in its motivations and objectives.

    1. Also, I am still confused about Fig.4 that define confidence (c) as c = 1 - u. The authors did answer the question about theoretical guarantee in defining confidence. Normally, the way people drawing reliability diagram by bin the predicted probabilities, and then calculate the mean predicted probabilities (noted as confidence) and mean of right predicted samples ( noted as accuracy), such as Guo et al. explained in the paper “On calibration of modern neural networks.”PMLR, 2017. In general, the novelty of this paper is ok, but more attention should be put to this paper to avoid confusing/wrong explanations.



Review #3

  • Please describe the contribution of the paper

    The authors propose a contour/landmark-based approach to segmentation with uncertainty estimation. A neural network regresses heat maps for each landmark, from which the expectation and covariance of the landmark location can be derived. The network can be trained end-to-end with a regression loss on the landmark’s location that is differentiable with respect to the heatmaps. Different variants are proposed for the data likelihood: axis-aligned or multivariate Gaussians, or a skewed normal distribution. The method is validated on lung and cardiac datasets, compared to an aleatoric uncertainty segmentation method, test-time augmentation and MC dropout.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    On the one hand, this approach to landmark regression and uncertainty estimation (with a standard regression loss but a heatmap approach) is methodologically interesting (which justifies my overall rating), although it is probably restricted to 2D applications or 3D sparse landmark detection.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    On the other hand the second part of the paper has flaws and the validation is less convincing. (more comments below)

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper would be difficult to reproduce without an official implementation due to the lack of clarity about the exact definition of uncertainty maps, as well as validation metrics. Details about the datasets are provided. Some but not all implementation details are provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • Unlike the “per-point covariance ellipses” which are very clear, it is difficult to understand from the text what the uncertainty maps display exactly, yet this is key for the paper. Even from the supplementary material this is not clear.
    • The paper uses validation metrics based on uncertainty maps to make it possible to compare to pixel-wise segmentation methods, but in the end does not compare to previous methods like probabilistic U-Nets, PhiSeg, or Stochastic Segmentation Networks.
    • The validation metrics are not easily interpretable.
    • MCE involves the per-pixel “predicted confidence” but it is not clear how this confidence is computed
    • Uncertainty-error Mutual Information: can be written as H(U)-H(U E). If MI is not normalized, it may favor high-entropy distributions (for a fixed reduction in entropy by a factor of 0<rho<1: H(U E)=rho*H(U), then MI=(1-rho)H(U) will be larger for larger entropy values H(U)). So there is the possibility that the uncertainty maps from the proposed method simply have higher entropy in general rather than being more correlated to the error maps.
    • In addition to these metrics, using the Expected Calibration Error (or other) directly on the regressed landmark coordinates would have been also instructive, although maybe not applicable to pixel-wise segmentation methods (although one could compute contours from the per-pixel segmentation samples, then equally spaced landmarks from the contours).

    Other comments:

    • Are the uncertainty maps’ pixel values (Fig. 2-3) normalized in some way or is the range arbitrary? My understanding from the supplementary material is that they are obtained by computing marginal probability density functions along lines orthogonal to the contour, and reporting these (unnormalized) p.d.f. values. It is also not clear why this can be interpreted directly as an uncertainty.
    • \hat{Z}^k is not defined. I assume it is Z^k normalized to sum to 1, but it should be stated.
    • Fig. 4, reliability diagrams “confidence is defined as the inverse of uncertainty”. Confidence is between 0 and 1 -> does it mean that uncertainty is between 1 and +infinity ? How is uncertainty defined?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach is methodologically interesting, so I believe it should be accepted, but there are still flaws in the paper that prevent me from rating it higher.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I have increased the score from 5: weak accept to 6: accept.

    My main expectation from the rebuttal was to have a crisp clear explanation of what these uncertainty maps correspond to, as it is important for the paper. Thus I am not fully satisfied with the rebuttal as it is still partly unclear to me (for instance the authors mentioned that, but did not explain how, they normalize the marginal distributions to obtain uncertainties, which is important for the MCE metric and reliability diagrams).

    Also the rebuttal claims that competing pixelwise methods that they did not evaluate against can only be trained when several segmentations are available, which is not the case.

    In the end these are not fatal flaws for me, and I still find the paper interesting and novel overall (the other reviewers did not argue otherwise either). I adjusted my grade to reflect this.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This work reformulates uncertainty quantification in segmentation within a landmark regression context, so mean and covariance of each landmark can be computed and used for uncertainty estimation. Reviewer 1 (who admittedly appears to be very enthusiastic about the landmark localization aspect of the method, more than about the segmentation point of view) liked a lot the paper: “The uncertainty estimation method is very novel to me, with an intuitive and theoretically convincing …” and recommended strong acceptance.

    Unfortunately, neither R2 nor R3 are fully convinced at this point, and I tend to agree with them that this paper would benefit from going through a rebuttal phase and some further discussion with reviewers. Both R2 and R3 “complain” about the unclear authors’ interpretation of the relationship between confidence and uncertainty, and why exactly can we interpret the variance of these gaussians as estimates of aleatoric uncertainty. I would like to encourage the authors to focus on answering these concern, while also focusing on the very detailed comments and questions provided by R3. Space allowing, please also address concerns by R1 about the per-dataset fluctuations in performance of the proposed technique.




Author Feedback

We thank the reviewers and AC for their constructive comments. The main criticisms were on the ‘relationship between confidence and uncertainty, and why can we interpret the variance of gaussians as estimates of aleatoric uncertainty.’ We addressed these as follows: R2 confusion about epistemic/aleatoric uncertainty. Uncertainty can be epistemic (from the model) and aleatoric (from data). We focus on aleatoric uncertainty because it can capture high image variability and image-degrading artefacts that affect the reliability of the segmentations. Still we offer a comparison with epistemic (MC dropout) in the evaluation. Other methods [15] estimate global uncertainty preventing this separate analysis. R2 Gaussians logits for modelling aleatoric uncertainty. As presented in equation 10 of [16], Gaussian perturbed logits (outputs) model aleatoric uncertainty. Confusion may arise from the term “logit”, which refers to results rather than model weights. Indeed, Gaussian modeling of network weights is a way of estimating epistemic uncertainty. R3 unclear uncertainty maps. When considering the task of segmentation, uncertainty maps express the probability of wrongly classifying pixels which is highest at the border between 2 classes. In our formalism, the probability of the presence of a contour (and thus the separation between 2 classes) can be effectively represented by the normalized marginal probability which are orthogonal to the mean contour. They can thus be used to derivate uncertainty maps at a pixel level as a narrow marginal (small variance) will have minimal uncertainty away from the mean contour while a wide marginal (large variance) will produce a larger region of uncertainty. R2 confidence as the inverse of uncertainty. “Inverse” is indeed not the right term. For uncertainty (u) bounded by 0 and 1, confidence (c) is defined as c = 1 - u. This is only used in the context of calibration. This is similar to [A] and equivalent to uncertainty calibration error [B] as confidence/uncertainty and error rate/accuracy are complements.

  • We briefly address other points: R1.2 MC Dropout objective function: trained with a MSE loss using the DSNT [21] output. The uncertainty is obtained with multiple forward passes.

R1.3 Univariate model details: We will modify the text to add clarifications.

R1.4-5 Results on N1, N2 and SN2: all proposed methods (N1, N2, SN2) perform well and always outperform benchmarks on almost every metric. Variations are due to the nature of the data, which may in cases favour simpler models (see discussion).

R2.1, R1.2 existing work on contour uncertainty: our contribution is on landmark localization for contouring. Prior work considers either 1) annotation uncertainty learned from expert variability or 2) landmark localization without the aspect of contour uncertainty. The 3 papers suggested by R2 express pixel-wise uncertainty which, similarly to [15], appears global and not strictly aleatoric. However, the aim of our paper is to introduce innovations in the estimation of aleatoric uncertainty along contours.

R3.1 No comparison to previous methods: these methods are designed for data annotated by multiple experts (not available for our datasets), making a direct comparison difficult.

R3.2 Validation metrics: mutual information is used to quantify how much of the error map can be predicted by the uncertainty(U) map [15]. U-maps with high entropy may bias this metric, but are likely to perform poorly on the other measures. Calibration was not computed on points due to space constraints and the difficulty of converting pixel-wise uncertainty to points.

R3.3: \hat{Z}^k is the result of normalizing Z^k to sum up to 1.

[A] Tornetta G.N. Entropy Methods for the Confidence Assessment of Probabilistic Classification Models. Statistica, 2021. [B] M.-H. Laves et al. Well-Calibrated Model Uncertainty with Temperature Scaling for Dropout Variational Inference. Workshop on Bayesian Deep Learning. NeurIPS 2019.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    R2 updated their rating from weak reject to weak accept, and also R3 raised the score to accept, while R1 maintained their Strong Accept recommendation. R2 and and R3 missed some explanation on the actual meaning of the computed uncertainty maps (epistemic vs aleatoric, or why can we interpret maps as confidence), but novelty and interest outweighs these limitations, which do not seem critical, and the consensus seems to be towards acceptance.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper studies segmentation uncertainty from a landmark regression type of context, meaning uncertainty estimates have an increased focus on critical points along the object boundary.

    The reviewers have reached a consensus of acceptance, among other things because they find the reformulation of uncertainty quantification interesting, and I don’t see enough grounds to go against this, but I do see some inconsistencies in the reviews and paper that the PCs should be aware of when reaching the final decision:

    • The most critical reviewer, R2, is in my opinion unreasonable when they strongly disgree that “aleatoric uncertainty quantification estimation is critical”. Also, their statements about estimating aleatoric segmentation uncertanty via Gaussian distributions over logits are incorrect – this is what mean-variance networks such as the stochastic segmentation networks (Monteiro et al, 2020) do. But on the other hand, this reviewier’s criticism of the reliability diagrams in the paper (fig 4) is 100% correct – the description given in the paper does not fit how reliability diagrams should be like.
    • There seems to be a lot of related work missing that considers uncertainty quantification as a problem on the segmentation boundary – anything that takes shape models as a starting point does exactly this. Tothova et al, MICCAI 2020 is one example – there are many more.

    Thus, in my opinion, this is still a borderline paper that could benefit from a more thorough rewriting and reviewing process. I don’t think the faults I see are strong enough to go against the unanimous vote of three reviewers, but I encourage the authors to clarify the above mentioned points before publication.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have clarified some of the points raised by the reviewers. A further clarification from the authors about the distinction between aleatoric and epistemic uncertainty



back to top