Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Skylar E. Stolte, Kyle Volle, Aprinda Indahlastari, Alejandro Albizu, Adam J. Woods, Kevin Brink, Matthew Hale, Ruogu Fang

Abstract

Out-of-distribution (OOD) generalization poses a serious chal- lenge for modern deep learning (DL). OOD data consists of test data that is significantly different from the model’s training data. DL models that perform well on in-domain test data could struggle on OOD data. Over- coming this discrepancy is essential to the reliable deployment of DL. Proper model calibration decreases the number of spurious connections that are made between model features and class outputs. Hence, cal- ibrated DL can improve OOD generalization by only learning features that are truly indicative of the respective classes. Previous work proposed domain-aware model calibration (DOMINO) to improve DL calibration, but it lacks designs for model generalizability to OOD data. In this work, we propose DOMINO++, a dual-guidance and dynamic domain- aware loss regularization focused on OOD generalizability. DOMINO++ integrates expert-guided and data-guided knowledge in its regulariza- tion. Unlike DOMINO which imposed a fixed scaling and regularization rate, DOMINO++ designs a dynamic scaling factor and an adaptive reg- ularization rate. Comprehensive evaluations compare DOMINO++ with DOMINO and the baseline model for head tissue segmentation from mag- netic resonance images (MRIs) on OOD data. The OOD data consists of synthetic noisy and rotated datasets, as well as real data using a different MRI scanner from a separate site. DOMINO++’s superior performance demonstrates its potential to improve the trustworthy deployment of DL on real clinical data.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_68

SharedIt: https://rdcu.be/dnwEl

Link to the code repository

N/A

Link to the dataset(s)

The dataset is private


Reviews

Review #1

  • Please describe the contribution of the paper

    In this manuscript, the authors built upon the DOMINO method, a calibration strategy for deep neural networks (DNNs) by deploying semantic confusability and hierarchical similarity between classes as a regularization term during training. The main aim of this paper was introducing dual-guidance penalty matrix with adaptive scaling and regularization rate as a domain-aware regularization term to improve the generalizability of the DNN model. They validated this strategy in the area of medical image segmentation and using MRI data. They used UNETR, one of the-state-of-the-art DNN method for image segmentation as the base model to test DOMINO++. Their findings indicate that the DNN model trained using DOMINOO++ method generalize better in comparison with a baseline UNETR and the one trained using DOMINO strategy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The described pipeline by the authors can positively affect the application of deep neural networks (DNNs) in medical image analysis. Recently, many DNN models have been proposed for different computer vision tasks such as image segmentation, registration, and reconstruction. Investigating and improving the generalizability of the DNN’s methods by adaptable domain-aware model calibration is an important concept that has been analyzed in this paper. The proposed method (DOMINO++) has the potential to improve the deployment of DNNs methods in the field by making the model to learn less spurious connections between features and classes and thus generalize well to out of distribution datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although the main idea of the paper (introducing a domain-aware regularization term in the loss function) is valuable and helps the DNN model generalize better, the amount of time and computation of training will be doubled (there are two training as shown in the flowchart for the DOMINO++-HCCM pipeline.) The reviewer thinks this needs to be discussed in the paper.

    While the authors described how W_HC and W_HCCM are computed, they did not provide any report on how much these two are different numerically. The reviewer thinks adding W_HC next to W_HCCM and compare them make the method section more intuitive and readable.

    The authors did not provide any explanations on how much W_HC and W_HCCM are task dependent. In other words, considering the amount of time and computation are needed to obtain W_HC and W_HCCM, can the authors discuss how one can determine whether obtaining both DOMINO-HC and DOMINO-CM together is helpful for a task or not?

    Following the previous comment, the reviewer thinks the results of Ablation studies are important and help the readers understand the method better. While the results are reported in the supplementary, there is no explanation in the text. Adding some explanations into the result section and referring the authors to supplementary will make the paper easier to follow.

    The authors only reported the average of their evaluation metrics (Dice and Hausdorff distance); they did not report standard deviation (SD) or variance among test set. The reviewer thinks adding SD is essential to interpret whether the DNN model was robust among all the image patches or not.

    To compare the methods quantitatively, the authors chose Dice score which is a balance of Precision and Sensitivity. However, there is no report on Precision and Sensitivity. The reviewer thinks adding Precision and Recall next to Dice score will help the readers find out whether the model is actually making a balance between these two metrics or not.

    The authors compared the average of the evaluation metrics without any statistical test. Adding the statistical significancy helps the readers of the paper understand how much DOMINOO++ were effective in practice.

    The authors did not discuss the sensitivity of the model to s and β parameters. To illustrate, in section 2.2, how did the authors reach S=10 for epoch 1, S=1 for epoch 2, and S=0.1?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    All the parameters and formula are explained. However, The UNETR hyperparameters such as feature size dimensions and the number of attention heads are not reported. Finally, the hardware information for running the pipeline is also reported.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    (1) report the DNN’s performance quantitively and qualitatively. Adding precision and sensitivity and standard deviation next to average metrics can help the readers understand how well the method performs. (2) report how sensitive the algorithm is to its parameters (s and β) (3) add statistical comparison between their method and baseline performance. (4) investigate / discuss the underlying reasons of their findings (is this regularization term always effective or is it dependent to task, model, loss function, etc)

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Interesting work, needs some improvement in terms of methodology (specifically evaluation)

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Submission 3114 extends the existing DOMINO (domain aware model calibration for medical image segmentation) by a dynamic framework for regularising the model. The proposed extension is demonstrated to improve performance by a significant margin (ca. 2%), a similar margin to the original DOMINO compared to the baseline.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Beyond the SOTA performance (in comparison to the baseline and the original DOMINO), the concept of calibration is interesting for the biomedical field, in particular for clinical applications.
    • The calibration approach can be applied to a wide range of deep methods.
    • The paper is well written and clearly presented.
    • Training, validation and testing are clearly presented.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The novelty is limited due to it being an extension to an existing approach (nonetheless, the achieved performance increase warrants communication of the results).
    • The experiments are performed on a single MRI dataset.
    • The utility of the approach is demonstrated with a single method (UNETR)
    • In comparison to baseline, the method is only marginally better at out-of-distribution segmentation in comparison to in-distribution segmentation (eg. compare baseline and DOMINO++ for site A clean with site B, the increment between them is very similar (Delta of 0.0345 vs 0.0355). There is no discussion of this. Does this call the improved generalisability into question?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducibility is expected to be good:

    • Code is available publicly
    • Dataset is private (explained in the paper)
    • Documentation of the method is thorough, clear dataset splits etc. are indicated
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • Demonstrating the DOMINO++ on a wider range of methods and datasets would be more convincing with regard to the general applicability of the approach.
    • Consider adding a brief overview of the key contributions (eg. like in 2.2) to the end of the introduction.
    • I encourage the authors to also make the data available.
    • Add a discussion about the fact that in comparison to baseline, the proposed method is only marginally better at out-of-distribution segmentation in comparison to in-distribution segmentation (see point 6 above). Does this call the improved generalisability into question?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the experimentation is limited, the calibration approach is interesting to medical segmentation and the improved results in comparison to the original method and the baseline warrant communication with the MICCAI community. The paper is also very well written and clearly presented.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    While the rebuttal addresses some of my concerns, the somewhat limited novelty (extension to existing method), limited experiments (single dataset) and limited demonstration of utility (single method) remain to a certain extent. Despite these limitations, this is an interesting paper that is very clearly presented. I maintain my original justification that this paper will be of interest to the MICCAI community.



Review #4

  • Please describe the contribution of the paper

    The article aims at providing a method (DOMINO++) to calibrate deep learning models during training and that is able to handle OOD data during inference time (different vendors, rotations..).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The article provides a an improved calibration method (incremental, based on previous contributions) that improves OOD handling at inference time. The novelty lies in an adaptative scheme and flexible hyperparameter selection that changes depending on the training epoch.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The article provides a comparison with a baseline segmentation network that does incorporate any standard regularization techniques such as data augmentation, which can help with OOD data at inference time. Further, testing is performed only on the basis of DSC and HD which greatly limits the evaluation of the work. In addition, the comparisons are limited since the authors only show improvement with respect to previous version of the methodology and a vanilla architecture (Transformer-based). The improvements are marginal when compared to previous version of the methodology and unclear whether that can have any real implication for clinical deployment.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors will release code if accepted. Other than that, the rest seems to comply with conference standards.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    First of all, thanks to the authors for the submission and the effort that they have put into the work. I would like to mention that I found the work interesting and calibration and OOD detection are definitely relevant applications to ensure safe deployments of DL-based software. In spite of it, I have several concerns. The authors present an “incremental” work (based on previous developed DOMINO) and an evaluation based on it. My first concern is that the only comparison is in between a baseline trained without any regularization (i.e. data augmentation), the previous version of the proposed approach and the newly one. I miss a more in-depth evaluation considering at least one other methodology to handle OOD and calibrate the model. Further, comparisons are stablished on the basis of HD and DSC, which limits the work and understanding of the method. What about volumetric metrics? Volume difference is usually a metric of interest. Keep it going!

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    An interesting article on calibration which improves over previous iterations of the algorithm. In spite of the remarked weakness, I believe the methodology and experiments shown in previous iterations can compensate for them.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper presents an extension to the existing DOMINO (domain aware model calibration for medical image segmentation) . There was a consensus between the reviewers that the work is intersting, the paper is well written and the results are convincing. There were however two major concerns related to the method’s novelty and its applicability to different datasets as it was tested on a single one. The reviewers also raised some important questions we feel should be addressed in a rebuttal.




Author Feedback

We thank all reviewers for their careful review and constructive feedback. We have also considered many related questions while writing, but space was limited. Other suggestions are very helpful for improving the quality of our future works.

  1. The consistency of improvement using our proposed regularization term in different tasks, models, and loss functions. One dataset (MRI) and model (UNETR) is limited.

We agree that the discussion of the regularization term across different tasks is important; however, space was limited by the eight-page maximum. We focused on replicating the original DOMINO study as closely as possible so that we could strictly compare the differences between DOMINO and DOMINO++.

  1. The authors should place W_HC and W_HCCM next to one another for a clearer comparison. The numerical differences and computation costs between W_HC and W_HCCM should be explained. The training time could increase from the DOMINO++_HCCM method.

We do not show the DOMINO-HC matrix because we use the exact HC matrix from the DOMINO paper. We reference this in Section 2.2 DOMINO++ Loss Regularization – Subsection “Combining expert-guided and data-guided regularization”. DOMINO-HC has a symmetrical block-like structure since the W_HC constitutes manual hierarchical groupings. DOMINO-HCCM varies greater by specific matrix entries. This is because it learns from both the hierarchical groupings and intuition from the data. DOMINO-HCCM does take two times the training as DOMINO-HC. You would choose DOMINO-HCCM over DOMINO-HC if you have rough (imperfect) hierarchical groupings for your data. In this case, you learn from the rough matrix during the first training and tune the results using your data-driven findings.

  1. Include analysis of Ablation studies.

We will add a section on the ablation studies in our camera-ready version.

  1. Evaluation metrics: only Dice and Hausdorff Distance were used. Precision, Sensitivity, and Volumetric Distance were suggested by reviewers. Add standard deviation (SD) to all metrics.

We choose Dice Score and Hausdorff Distance as our metrics because they are two well-validated metrics for image segmentation, whereas our page space is limited. We will include SD in the final version.

5.Sensitivity of the model to the s and β parameters.

We did not discuss the sensitivity to specific scaling terms or regularization weightings since they both change over epochs. The scaling factor S is calculated to be the closest number on the logarithmic (log base 10) scale to the current epoch’s standard loss. The closest numbers to L=13, L=1.5, and L=0.1 are 101, 100, and 10-1, respectively. The supplementary material helps show the model’s sensitivity to including these parameters.

  1. Incremental improvement between the delta of our baseline model compared to DOMINO++.

We believe that the improvement in the Site B results is still worth the inclusion. Handling data from multiple scanners is non-trivial in deep learning for medical imaging. We found that the relative difference in DOMINO++ performance on clean Site A data versus rotated Site A data was smaller than the baseline performance difference between these two datasets (DOMINO++ was 6.5%, Baseline was 10.6%). This suggests that our model may currently be the most equipped to handle MRI motion issues. Our future work will focus on greater consistency across MRI scanners.

  1. DOMINO++ was only compared with a baseline model without regularization (e.g., data augmentation) and DOMINO.

Here unregularized baseline model refers to no DOMINO regularization term only, and we did apply data augmentation strategies in our baseline model. These augmentations include random rotations along each axis and Gaussian noise additions (mean=0, standard deviation=0.1). However, we understand that neglecting this information was not helpful for our paper’s clarity. We would be happy to edit our explanation of our baseline model within the camera-ready paper.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors addressed most of the reviewers concerns. The work is interesting and its merits outweigh it weaknesses.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Although the work is of limited novelty and was performed on limited datasets (and I do not think that the rebuttal addressed the major concerns), all reviewers still had positive comments on the quality of the paper/topic and agreed that it would be an interesting work to discuss in the MICCAI forum.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The overall recommendation from the reviewers is to accept the work with no reviewer recommending reject. These reviews are reasonable. However, the response to reviews in the rebuttal seems to lack effort. I would strongly recommend the authors to implement the requested changes should the paper be finally accepted.



back to top