Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Joshua Durso-Finley, Jean-Pierre Falet, Raghav Mehta, Douglas L. Arnold, Nick Pawlowski, Tal Arbel

Abstract

Image-based precision medicine aims to personalize treatment decisions based on an individual’s unique imaging features so as to improve their clinical outcome. Machine learning frameworks that integrate uncertainty estimation as part of their treatment recommendations would be safer and more reliable. However, little work has been done in adapting uncertainty estimation techniques and validation metrics for precision medicine. In this paper, we use Bayesian deep learning for estimating the posterior distribution over factual and counterfactual outcomes on several treatments. This allows for estimating the uncertainty for each treatment option and for the individual treatment effects (ITE) between any two treatments. We train and evaluate this model to predict future new and enlarging T2 lesion counts on a large, multi-center dataset of MR brain images of patients with multiple sclerosis, exposed to several treatments during randomized controlled trials. We evaluate the correlation of the uncertainty estimate with the factual error, and, given the lack of ground truth counterfactual outcomes, demonstrate how uncertainty for the ITE prediction relates to bounds on the ITE error. Lastly, we demonstrate how knowledge of uncertainty could modify clinical decision-making to improve individual patient and clinical trial outcomes.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_46

SharedIt: https://rdcu.be/dnwHr

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a method to leverage uncertainty-aware causal models to study treatment effects and conducted a proof-of-concept for the appearance of new MS lesion when treated with different drugs. They were able to establish lower and upper bounds for the idividual treatment effect error.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Clearly motivated and grounded on prior research
    • Good guidance of the reader through the proposed method and experiments
    • Clear evaluation and message
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No major weaknesses to report.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The experimental setup is clearly described in the main text; the supplementary material provides further information on the network architecture and learning parameters.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The authors present a method to add uncertainty to causal models applied to assessing individual treatment effects. This is in my opinion highly significant, since it also provides a way to boost statistical power to clinical studies, where treatment options need to be carefully selected and some research may be prohibitive from an ethics perspective. The paper is clearly written, well-organized, and guides the reader almost in a “tutorial”-style through the method, experiments and results. The figure are well-prepared and contribute to the understanding of the method and the results.

    Minor remarks:

    • The focus is clearly on the method you propose. However, please consider moving the dataset description to e.g., an earlier materials section (consider therefore renaming the current “Methods” section to “Materials and Methods”.
    • The figures contains very small text, consider making either the text or the figure larger if the space contraints allow it.
    • In section 2.1, \tau_t(x) jsut below equation 1 looks like a it is squared, but is actually a footnote. Consider making this more clear.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a method that fills a relevant gap, prior work is appropriately discussed, the paper is well-written and the claims are backed up by a proof-of-concept for multiple sclerosis.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    I thank the reviewers for their clarifications. I still believe the paper merits a strong accept due to the high relevance and thorough proof of concept.



Review #2

  • Please describe the contribution of the paper

    In this paper the authors proposed a Bayesian deep learning approach for estimating the posterior distribution and uncertainty for treatment response. The proposed approach was evaluated on a large, multi-center dataset of MR brain images. Multiple experiments were preformed to show the benefit of predicting a distribution rather than a simple mean prediction.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper addresses a very important question: estimating uncertainty in imaging-based precision medicine. The proposed approach is novel and could be useful for the field.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    A lot of experiments were performed but they don’t necessarily demonstrate the advantage of the proposed approach as it wasn’t compared with existing uncertainty estimation algorithms.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility is below average. The experiments were carried out using data from randomized clinical trails that are not publicly available. Code also won’t be shared.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    This paper shows promise and is well-organized and well-written. However, the reviewer has some concerns regarding the evaluation section. While the authors conducted extensive experiments to demonstrate the benefits of predicting distributions using their proposed approach over traditional predictions that rely on simple mean values, most experiments do not convincingly demonstrate the superiority of the proposed approach.

    One of the main issues is that the evaluation strategy of the proposed approach is not fundamentally different from existing uncertainty estimation approaches, yet it was not compared with baseline and state-of-the-art approaches. Although the authors mentioned that “The usual strategy for validating uncertainty estimates, discarding uncertain predictions and examining performance on the remaining predictions, is not always appropriate…,” most experiments in this paper are also based on discarding uncertain predictions. Additionally, since the ground truth of treatment response distributions is not available, the correctness of the distributions cannot be validated. While the authors made some nice attempts to validate their predictions, a comparison with results from existing uncertainty estimation methods and simple mean value predictions is necessary to demonstrate the superiority of the proposed approach.

    Some minor issues: 1) the font in the figures could be larger for better visibility 2) some training details are missing, e.g. how was the ResNet trained? Was a pretrained model used or was a model trained from scratch using all datasets with different drugs?

    In summary, the paper has some promising aspects, but the evaluation section requires further development. The reviewer encourages the authors to include a comparative evaluation with existing methods, which could significantly enhance the paper’s contribution to the field.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has some flaws in evaluations but overall is of novelty and merits.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The rebuttal addressed most of my concerns.



Review #3

  • Please describe the contribution of the paper

    This paper provides a method for imaging-based prediction of multiple treatment outcomes along with uncertainty estimates in those outcome predictions. Bayesian deep learning models are trained using data from multiple clinical trials, each of which assigned participants to a single treatment, had the person provide a baseline brain MRI, and had trial outcomes measured. After training, a new brain MRI can be given (alongside some clinical covariates) and the placebo-adjusted treatment response to any of the trained-upon treatments can be estimated. The method is applied to multiple sclerosis clinical trials, where white matter lesion accumulation is the outcome of interest.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength is that estimation of uncertainty in treatment effects is a frontier problem in medical image analysis currently. Generally speaking machine learners in medicine are far too sure of themselves, providing yes/no answers without providing any hint at how much training data is really backing up those answers. As the authors point out, there are thoughtful ways to incorporate such uncertainty into clinical scenarios (beyond simply deleting uncertain predictions).

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There’s critical information about how the method actually works. It builds on an existing ResNet architecture that takes various input features as input and produces a most likely outcome associated with those feature values. This new approach takes such features as input but now produces a mean and standard deviation in outcome… but it’s not clear how it accomplishes that. The inputs to network training still appear to be feature/outcome pairs, not feature/mean outcome/SD of outcome triples, and presumably no two individuals assigned to the same treatment have precisely the same feature values. So there must be some kind of smoothing or regularization going on to determine that there’s outcome uncertainty associated with one specific feature value. We aren’t told how that actually happens. Also, it appears that there are separate modules in the neural network, one of which operates on imaging data, the other of which operates on ancillary clinical data… but it isn’t clear what either of those modules are doing; how exactly they are trained; what the training parameters are; what software it is all implemented in; etc.

    Another issue is that unless I missed it, a neural network architecture is trained, and then its performance is evaluated on the very same data set it was trained on, rather than an independent test set. As a result, there are serious concerns about over-statement of algorithm performance.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The data seems to be well described, so that anyone getting access to the described clinical trial data is in a good position to reproduce. But as stated above there’s a huge amount of information about what the neural network training approach actually consists of; how it works; what it’s doing; exactly how to implement it and so on. There’s no chance of reproducing these experiments based solely on what is written here.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    See above.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The thoughtful consideration of use cases for the uncertainty, beyond simply removing all uncertain predictions, is highly welcome. The experiments provide what is probably the best one can do in terms of evaluating uncertainty estimates in this setting, given that one cannot actually evaluate outcomes for a treatment someone hasn’t actually taken.

    As stated above though, the core of the paper seems to be missing— the changes to an existing architecture and procedure that made it possible to estimate means and SDs in outcomes associated with a specific setting of the feature values. Everything hinges on that core; once that part is figured out, the rest of the paper is relatively easy / simple.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I am now satisfied that I understand how the method works, although the description of how it arrives at individual prediction means and SDs is rather cryptic. Basically it is trying to predict outcome means and SDs such that the training set is predicted with high probability, i.e. with means close to those ground truth means and corresponding SDs small. Now that I have a sense of how this works I am improving my score although I urge the authors to make the description less cryptic.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposed a Bayesian deep learning approach for estimating the posterior distribution and uncertainty for treatment response. Key strengths:

    1. Clear motivation and important clinical problem
    2. Novelty in formulating uncertainty
    3. Comprehensive experimental design

    Key weaknesses:

    1. Lack comparison with existing uncertainty estimation algorithms.
    2. Clarity in method description, missing some details.

    In the rebuttal, please especially clarify the method architecture and evaluation strategy.




Author Feedback

We thank the reviewers for their time and providing valuable insights into our paper. Our work represents a novel application of uncertainty propagation to enhance precision medicine tasks and treatment effect estimation. We are delighted that the reviewers recognized the innovative aspect of our approach. Method Description: Reviewer 3 noted missing implementation details. Some model architecture and hyperparameters details are found in section 2.2. However, additional details were available in Appendix Fig 3 which may have been missed due to a missing reference. Because our multi-headed ResNet is based on previously published architectures, we also referred the reader to [Durso-Finley 2022, Shalit 2017] for more details on the training procedure. However, we can add the package (Pytorch) and hardware details in the camera-ready version and add a reference to the appendix in section 2.2.. Reviewer 3 asked how we trained our probabilistic estimates. In section 2.2, we describe how we adapted a machine learning uncertainty quantification (UQ) technique [Kendall and Gal, 2017] for counterfactual uncertainty estimation. In that section we describe how we parametrized the outcome distribution as a Gaussian and optimized its parameters using maximum likelihood estimation. This method does not require ground truth targets for counterfactual outcomes and therefore does not require smoothing which the reviewer suggested may be needed. Reviewer 3 questioned whether we evaluated the model on training data. We wish to clarify that the dataset used for evaluating the model is always distinct from the dataset used for training. Specifically, we use nested cross-validation [Vabalas, 2019] and will add the word “nested” in the camera-ready version. Evaluation Strategy: Reviewer 2 suggests that a comparison of uncertainty quantification (UQ) methods would be beneficial. Typically, validating UQ methods involves filtering the uncertain samples and showing an improvement in the performance based on ground truth labels. However, we emphasize that the focus of the paper is to introduce a strategy for applying uncertainty in counterfactual individual treatment effect estimation (ITE) where, as the reviewer states, there is no ground truth available. As such, this paper proposes a novel strategy to evaluate the uncertainty of the ITE precisely in the absence of ground truth. Specifically, the novelty in our experimental procedure is that we validate our ITE predictions with the bounds on the ITE error (Section 2.3). In figure 3b and 3c, we show decreasing bounds on the ITE error with more confident predictions. We believe this is a possible solution to validating causal models in real world tasks. We can clarify this novelty in section 2.3. The novelty in our strategy to improve precision medicine was to propagate the uncertainty to downstream treatment effect estimation tasks. In some cases (Figure 6), enriching a trial with patients with low predictive uncertainty is indeed the optimal strategy, but this is not always the case. For example, in figures 4a and 4b, the magnitude of the uncertainty of a patient is not the only consideration when deciding the optimal treatment and in figure 5 the effect of uncertainty-driven treatment decisions is improved where certain clinical outcomes are more “costly” than others. As the focus is on the novel application of uncertainty and not the method of uncertainty estimation, we chose a popular uncertainty estimation technique in machine learning [Kendall and Gal, 2017] to evaluate our strategy. Our strategy is agnostic to the uncertainty estimation method used, so that the user is free to use their own. We will emphasize the versatility in the camera ready version. The reviewer also suggested comparing our method to a baseline using only the mean prediction. This baseline is in fact already present in figures 4a, 4b, and 5a (referred to as “mean policy”).




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposed a Bayesian deep learning approach for estimating the posterior distribution and uncertainty for treatment response. Key strengths:

    1. Clear motivation and important clinical problem
    2. Novelty in formulating uncertainty
    3. Comprehensive experimental design

    The rebuttal has adequately addressed the method detail issue and clarify the novelty. It also clarifies the issue for baseline comparison.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Key strengths:

    • Good motivation and clinically important problem to understand uncertainty in outcome of treatment assignment
    • Novelty of approach

    Key weaknesses:

    • Lacking comparisons to other uncertainty estimation approaches
    • Lack of clarity regarding methodology details

    The rebuttal helped clarify the methods and evaluation approach, as evidenced by the increase in reviewer scores. This seems like a really interesting paper in a still nascent area of uncertainty estimation with a clinically impactful application. While I do have concerns regarding the evaluation (perhaps what was done was the best that could be done for the given dataset and was an interesting validation method, but really it would have been helpful to validate on a dataset with individual ground truth treatment outcomes), I think this paper would be of interest to the community.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper addresses an importan problem that did not receive much attention before. While it is indeed the case that the novelty of the methods is limited, their combination does appear to achieve interesting preliminary results which are close to the acceptance threshold. I am thus inclined to recommend acceptance with a low grade.



back to top