Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jadie Adams, Shireen Y. Elhabian

Abstract

Statistical shape modeling (SSM) enables population-based quantitative analysis of anatomical shapes, informing clinical diagnosis. Deep learning approaches predict correspondence-based SSM directly from unsegmented 3D images but require calibrated uncertainty quantification, motivating Bayesian formulations. Variational information bottleneck DeepSSM (VIB-DeepSSM) is an effective, principled framework for predicting probabilistic shapes of anatomy from images with aleatoric uncertainty quantification. However, VIB is only half-Bayesian and lacks epistemic uncertainty inference. We derive a fully Bayesian VIB formulation and demonstrate the efficacy of two scalable implementation approaches: concrete dropout and batch ensemble. Additionally, we introduce a novel combination of the two that further enhances uncertainty calibration via multimodal marginalization. Experiments on synthetic shapes and left atrium data demonstrate that the fully Bayesian VIB network predicts SSM from images with improved uncertainty reasoning without sacrificing accuracy.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_34

SharedIt: https://rdcu.be/dnwBu

Link to the code repository

https://github.com/jadie1/BVIB-DeepSSM

Link to the dataset(s)

https://drive.google.com/file/d/1oegp1RZVFJfx8s7aL5GgX55UKGhRgAZn/view


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors introduce an extension of VIB- statistical shape models by making the network probabilist, i.e. by estimating a posterior probability of the weight of the variational auto-encoder. This allows to estimate the epistemic uncertainty, i.e. related to the model after inference.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • the paper introduces Bayesian model fitting for shape analysis which has never been done in medical imaging.
    • the authors introduce a novel way to efficiently capture the posterior on the weights of the networks by combining concrete dropout and batch ensemble which was not proposed before.
    • Two thorough examples are proposed where both types of uncertainties are displayed. This could lead to interesting future works since the combination of both uncertainties should be related to the estimated error.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper is very dense with lots of concepts presented. It may not be readible for many scientists but the authors have done their best to make it accessible
    • the description of the new Integration of Dropout and Ensembling for the weight posterior is very limited. and it is hard to appreciate the interest of combining them together.
    • The authors do not discuss enough the link between image and model error and the 2 types of uncertainties.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    ok

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • the authors should describe how the two weight posterior distributions differ in nature and how their combination provides more “expressive” posterior distribution.
    • the memory and computation cost of the method should also be discussed as it is a limitation of the method
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces Bayesian neural network for shape analysis. This is a sophisticated approach with limited evaluation but which is original and ambitious to estimate the model-based and image-based uncertainties.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a formulation of direct image-to-point-correspondence prediction that accounts for aleatoric and epistemic uncertainty. It uses (batch) ensembling or concrete dropout or a combination of both to capture uncertainty on the network parameters (epistemic uncertainty). It builds upon previous work (VIB-DeepSSM) that did not use ensembling. The method is validated against the baseline VIB-DeepSSM in terms of accuracy and uncertainty-error correlation, on two datasets: a synthetic supershapes dataset and a left atrium dataset. Ensemble methods generally display similar accuracy and better uncertainty-error correlation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is interesting. The related work, method and results are explained clearly and they are sound. The paper is an incremental improvement over prior work regarding uncertainty estimation in DeepSSM models. I do not have major criticism regarding the paper, so I am in favor of accepting it.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    N/A, maybe that in a way the improvement over previous work is quite incremental.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    From the description of the method in the paper and by looking at previous work, much of the proposed approach would be reproducible (but it builds on multiple iterations of previous work). Some details about the training are provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Minor comments:

    • It would have been interesting to report the expected or maximum calibration error (from comparing, over all landmarks and all samples, the error in localizing the landmark to the predicted standard deviation); or alternatively the expected predictive log-likelihood. Both convey more information about the calibration of uncertainties, compared to a correlation coefficient.

    • The reader is referred to appendices for derivation details, but derivation details are not provided in appendices:
    • Eq. 4 does not show how to reach Eq. 3
    • Eq. 5 is stated without proof or reference (although I believe it is correct)

    • “The prediction uncertainty is a sum of the epistemic (variance resulting from marginalizing over Θ) and aleatoric (variance resulting from marginalizing over z) uncertainty”. There is a small confusion here as the variance as per Eq. 5 is the sum of (1) the variance coming both from marginalizing over Θ and z, and (2) the average (aleatoric) variance of the Gaussian p(y z,theta). So if you want to state that aleatoric and epistemic components of the variance are additive, you have to include marginalizing over z as part of the epistemic variance.
    • “To the best of our knowledge, this combination has not previously been proposed with the motivation of multimodal marginalization for improved uncertainty calibration”: I believe the combination of dropout and batch ensembling was proposed in the original BatchEnsemble paper.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I have no major criticism regarding the paper.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper
    • This paper addresses calibrated estimation of both aleatoric and epistemic uncertainty for deep learning based statistical shape models in medical imaging.
    • The core contribution is a novel fully Bayesian formulation of the existing VIB-DeepSSM using variational inference with both concrete dropout and ensembling.
    • The novel combination of CD and ensembling to create a multimodal approximation to the true posterior is motivated by creating better uncertainty estimates.
    • The benefit of the presented approach is evaluated on a synthetic and a real dataset of images and corresponding shapes.
    • The presented approach is a valuable contribution and the results demonstrate that it outperforms the former non (or half) Bayesian approach both w.r.t. accuracy and calibration.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper addresses an interesting and important problem in deep statistical shape models. Especially in deep regression, uncertainty estimation is often overlooked. Deep SSM in general has the potential to alleviate the drawbacks of traditional SSM and a fully Bayesian approach to that is highly appreciated.
    • The presented approach is novel, elegant, well described, and well motivated. In my view, the formulations are correct and easy to follow.
    • I personally like the idea of multimodal marginalization by combining CD and ensembling. Ensembling could also counteract the small accuracy loss that is sometimes connected with CD and other MC dropout techniques.
    • Using outlier sets to comprehensively examine the behaviour of estimated uncertainty is a good idea and shows where good uncertainty estimates matter most.
    • The paper itself is well written and all figures are of high quality and offer great value to the reader.
    • The cited references provide a good overview of works that are relevant.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The connection of the derivation of the method to PAC-Bayes bounds is incomplete, or at least confusing to me. As the authors describe, using VI leads to arriving at the same ELBO. The fact that using PAC-Bayes bounds will lead to the same formulation is an interesting fact, but does not contribute to the main message of the paper. Moreover, the authors say in their contribution statement that they will derive the fully Bayesian VIB from both perspectives. However, in the methods section, only VI is used in the derivation and the equivalence to PAC-Bayes is merely stated by reference to [4].
    • While the paper is generally a good contribution, I think that the novelty w.r.t. VIB-DeepSSM is limited. The extension from VIB to a fully Bayesian approach is a logical step, but only a minor one.
    • While the use of Pearson’s correlation to assess uncertainty calibration is a good first step, I would appreciate the use of dedicated calibration metics for regression uncertainty.
    • The paper does not use hypothesis testing/significance testing to analyze the results.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The methods, dataset and training procedure are described sufficiently well in order to reproduce the experiments.
    • The supershape data is generated and reproducible.
    • The authors state that the code will be released after acceptance of the paper.
    • However, it seems like that the left atrium dataset is private and therefore, results on this dataset cannot be reproduced.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • As stated above, the connection to PAC-Bayes bounds is confusing, given that VI leads to the same formulation. I would remove at least the contribution statement related to that and only mention the link to PAC-Bayes in the discussion of the method.
    • I suggest to use proper uncertainty calibration metrics for regression to assess the goodness of the uncertainty estimates.
    • I suggest to test for statistical significance of the improvement over VIB-DeepSSM w.r.t. calibration. One could use bootstrapping to estimate confidence intervals and levels of significance of the correlation analysis. Given the relatively large improvement over VIB, I would expect that the results are in fact significant.
    • The same applies for RMSE, although I would expect that the RMSE is not significantly larger compared to VIB. I.e., the introduction of a fully Bayesian approach does not lead to worse accuracy.
    • One could also test for significance of the multimodal approximation (ensemble+CD vs. ensemble only and ensemble+CD vs. CD only). However, one has to correct for multiple comparisons in this case. This would further strengthen the claim of the paper that multimodal approximation of the posterior is beneficial.
    • What does burn-in for concrete dropout mean in § 3.3? Please elaborate.
    • The authors describe how they used PCA to select shape and image outliers. I wonder how the outliers actually look like w.r.t. the mean shape/image and what qualifies an outlier from an anatomical or image-quality point of view.
    • Fig. 3 breaks the text mid-sentence. I would place the figure at the bottom of the page.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is already in a very good shape. Even though the novelty is limited w.r.t. to VIB-DeepSSM, I think that it will be a good contribution and I would like to see the paper being discussed at MICCAI. An extension of the experimental analysis, including significance testing, would further strengthen the conclusion of the method.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a Bayesian deep learning based statistical shape model by estending the VIB-DeepSSM framework. All reviewers recommend acceptance, agreeing that the paper is well written, and tackles an interesting problem – and I therefore recommend early acceptance.

    For the final revision, the authors should take the reviewer comments into account – especially those regarding clarity. Moreover, the authors state that they are the first to tackle high resolution mesh reconstruction while providing a principled quantification of uncertainty – but I believe earlier appraches to this problem includes Tothova et al, Probabilistic 3D surface reconstruction from sparse MRI information, MICCAI 2020, and this probably should be discussed in related work.




Author Feedback

We’d like to thank the reviewers for their valuable comments. We have addressed each of the individual reviewers’ concerns below and will incorporate all feedback into the revision.

Meta-reviewer: Our work focuses on the task of predicting Statistical Shape Models (SSMs) from images - an inherently different and more challenging task than mesh reconstruction. Although mesh reconstruction can be achieved using the predicted correspondence points, it is not our primary objective. Instead, our aim is to represent shape in a manner that captures statistical information at the population level. Nonetheless, the Tothova et al. paper is related, and while we have referenced it, we will make the connection to it and other similar papers more clear in the revision.

Reviewer #1: We apologize if the integration of dropout and ensembling for the weight posterior could have been more clear. In dropout, the approximate posterior distribution of weights is parameterized by a Bernoulli distribution. In the case of concrete dropout, the discrete Bernoulli distribution is replaced by its continuous relaxation (the concrete distribution) to allow for automatic dropout probability tuning. When this is combined with ensembling, multiple concrete distributions are learned and marginalized, resulting in multimodal marginalization. This mixture of concrete distributions provides a multimodal approximate posterior on weights, which has increased flexibility and expressiveness over a single concrete distribution. We will clarify this in the revision and make the limitation of increased memory more clear.

Reviewer #2: As you noted, the derivation of the method from the PAC-Bayes perspective could be more complete. This derivation can be obtained from Alemi et al. and is not necessary for understanding the approach, thus including it in this work would not be the best use of space. We acknowledge that the PAC-Bayes derivation should not be highlighted as a main contribution and will revise the paper to reflect that change. We appreciate the feedback with regard to the calibration methods and significance testing and will incorporate it into future work. Regarding the reference to loss burn-in in 3.3, this is a technique utilized in VIB-DeepSSM to improve accuracy and convergence speed. The loss is converted from deterministic (L2) to probabilistic (Eq. 3) over epochs to allow the network first to learn to predict PDMs alone and then with uncertainty. We will add this clarification. Figure 3A provides some examples of what the outliers look like. Outlier shapes may have unusual size or large left atrium appendages or ventricles and outlier images may be noisy or over-exposed.

Reviewer 3: We agree our evaluation could be strengthened by metrics such as expected calibration error and predictive log-likelihood and will incorporate this in future work. We will add further explanation regarding reaching Eq.3 and add references for Eq. 5 in the revision. You are correct the batch ensemble paper does suggest combining with MC dropout, although it lacks theoretical motivation for this combination from a Bayesian perspective. Thank you for pointing this out, we will make this correction in the revision.



back to top