Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Skylar E. Stolte, Kyle Volle, Aprinda Indahlastari, Alejandro Albizu, Adam J. Woods, Kevin Brink, Matthew Hale, Ruogu Fang

Abstract

Model calibration measures the agreement between the predicted probability estimates and the true correctness likelihood. Proper model calibration is vital for high-risk applications. Unfortunately, modern deep neural networks are poorly calibrated, compromising trustworthiness and reliability. Medical image segmentation particularly suffers from this due to the natural uncertainty of tissue boundaries. This is exasperated by their loss functions, which favor overconfidence in the majority classes. We address these challenges with DOMINO, a domain-aware model calibration method that leverages the semantic confusability and hierarchical similarity between class labels. Our experiments demonstrate that our DOMINO-calibrated deep neural networks outperform non-calibrated models and state-of-the-art morphometric methods in head image segmentation. Our results show that our method can consistently achieve better calibration, higher accuracy, and faster inference times than these methods, especially on rarer classes. This performance is attributed to our domain-aware regularization to inform semantic model calibration. These findings show the importance of semantic ties between class labels in building confidence in deep learning models. The framework has the potential to improve the trustworthiness and reliability of generic medical image segmentation models. The code for this article is available at: https://github.com/lab-smile/DOMINO.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_44

SharedIt: https://rdcu.be/cVRyY

Link to the code repository

https://github.com/lab-smile/DOMINO

Link to the dataset(s)

the dataset is currently still private


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper proposes that deep-learning models calibrated with domain aware model calibration (DOMINO), are more accurate than models not tuned with DOMINO. To test this medical image segmentation algorithms were chosen for this analysis. Specifically, algorithms that perform head segmentation. The DOMINO framework that uses the confusion matrix (UNETR-CM) outperforms both hierarchical class-based (UNETR-HC) and non-calibrated UNETR (UNETR-Base), in most instances, however both UNETR-CM and UNETR-HC outperform UNETR-Base. Calibrated models also outperform Headreco on all tissue classes except for gray matter and csf.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This is a novel approach to tuning deep learning algorithms that can help increase deep-learning model accuracy.

    2. Very well organized paper that was very detailed, enough to where the methods could be reproduced.

    3. This method could be used across multiple deep-learning models, even outside of segmentation. DOMINO appears to be useful across multiple deep-learning algorithm domains.

    4. The paper also displays a good understanding of the problem paired with a novel solution.

    5. The authors were able to demonstrate the validity of the DOMINO framework through extensive quantitative testing measures.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    N/A

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    1. The methods are very detailed. The derivations for the approach are shown, which makes this framework reproducible.

    2. Images are not publically available, however, all parameters used, such as repetition time (TR), echo time (TE), and field of view (FOV), are all listed, which means the study is also reproducible.

    3. The authors do say they will release DOMINO in the future.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Experimental and Results

    1. The ‘Ground Truth’ section, it is mentioned that 11 tissue types are being used. These 11 tissue types should be listed here as well. SOme were listed, however, all were not. All 11 do appear in the figures, but they should all also appear in the text.

    2. In the same section as above, the semi-automated segmentation routine used by the trained staff member should also be mentioned here. This helps to add to reproducibility of the pipeline, especially for training.

    3. In the ‘Evaluation Metrics’ section, when the author says ‘brain segmentation’, I think the authors meant to write ‘head segmentation’

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    8

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors introduce a novel approach to tuning deep learning algorithms. This approach could help improve model accuracy. The paper is very detailed and gives a step by step of the approach, the reason behind the approach, and does a detailed quantitative analysis on the output of models tuned, untuned, and against a widely used algorithm. This is solid work that can be expanded and used across many different deep learning algorithms solving different problems, other than segmentation.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors put forward a sensible domain-aware loss function which leverages class similarity and hierarchy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is very well written, very elegant, well structured
    • Experiments are convincing
    • Motivation is quite clear
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • My main concern about this work is that the loss function now depends more on the training data. I wonder whether the performance on another dataset would then be compromised. The authors need to include experiments on other datasets acquired with distinct imaging considerations to show whether this is the case or not.
    • Performance of network wrt to headreco: Fig 5 shows headreco can segment most regions nicely (except air). I wonder then what would a model trained on headreco with the domain-aware loss would have to offer?
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Ok

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Minor points

    • Introduction: “interpretability” is not really addressed, consider removing this term.
    • The way to compute the hierarchical class-based W matrix is unclear. The authors mention “ following this formula”, but I could not find any.
    • Why is S set to 3? What is the effect of changing this parameter?
    • Section 3.3. Unclear what the authors mean by “due to its high variability between individuals”. All tissues would vary between individuals, wouldn’t they?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work is very interesting, the proposal makes sense, and the manuscript very well written. There are a few items (“marketing” and generalizability) that I think the authors should consider.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper is very clearly written and presents an elegant solution to an important problem. Despite a few shortcomings highlighted by the reviewers regarding the possibility for generalisability that would be interesting to further discuss, this manuscript would be an interesting addition to the MICCAI conference

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

Q1. In the same section as above, the semi-automated segmentation routine used by the trained staff member should also be mentioned here. This helps to add to reproducibility of the pipeline, especially for training.

The groundtruth section does describe the semi-automated segmentation routine already. We have added additional details about this routine to this section to enhance clarity. It now reads as follows: The semi-automated segmentation consists of three phases: automated segmentation, manual correction, and label refinement. First, base segmentation for white matter (WM), gray matter (GM), and bone classes were obtained using Headreco, while air was generated in SPM12. Next, all automatically generated labels were manually corrected by domain experts using ScanIP Simpleware™ software (version 2018.12, Synopsys, Inc., Mountain View, USA). In the label refinement phase, bone was further classified into cancellous and cortical tissue using thresholding and morphological operations. The blood, skin, fat, muscle, and eyes (sclera and lens) were also manually segmented in Simpleware. CSF was generated by subtracting the other ten tissues from the entire head volume. The resulting 11 tissue masks served as the ground truths for learned segmentation.

Q2. My main concern about this work is that the loss function now depends more on the training data. I wonder whether the performance on another dataset would then be compromised. The authors need to include experiments on other datasets acquired with distinct imaging considerations to show whether this is the case or not.

We do not believe that our method has any additional dependence on the data than any other. However, we will look for another testing dataset to enable us to compare our performance on data that was collected with slightly different parameters.

Q3. Performance of network wrt to headreco: Fig 5 shows headreco can segment most regions nicely (except air). I wonder then what would a model trained on headreco with the domain-aware loss would have to offer?

Headreco does not implement deep learning in its segmentation pipeline. Currently, there is no clear way to combine our deep learning model with Headreco. Headreco performs segmentation based on other software called SPM and CAT. These software packages are open-source in theory; however, they function based on the paid software MATLAB. Our domain-aware loss is used to train a deep learning pipeline in Python. Hence, these are two completely different pipelines. We have updated the text to reflect this limitation of Headreco.

Q4. The way to compute the hierarchical class-based W matrix is unclear. The authors mention “ following this formula”, but I could not find any.

The authors appreciate the reviewer bringing this ambiguity to our attention. Additional context has been added to clarify the generation of the UNETR-HC matrix penalty. The relevant portion now reads as follows: “Table 1 shows the hierarchy for the head segmentation task. We define the matrix penalty shown in Fig 1a by considering which classes are subsets of the same super-class. In this figure, each row represents the penalties for confusing the given class with any other class. Here, the maximum penalty is 3, and penalties are manually lowered within the hierarchical groups of Table 1. In the case of eyes, we lowered the penalties for similar classes that are spatially close to the eyes. This method of generating the matrix penalty is more subjective than UNETR-CM, but it allows us to incorporate domain knowledge into the loss.”



back to top