Authors

Jonathan Lennartz, Thomas Schultz

Abstract

Domain shift occurs when training U-Nets for medical image segmentation with images from one device, but applying them to images from a different device. This often reduces accuracy, and it poses a challenge for uncertainty quantification, when incorrect segmentations are produced with high confidence. Recent work proposed to detect such failure cases via anomalies in feature space: Activation patterns that deviate from those observed during training are taken as an indication that the input is not handled well by the network, and its output should not be trusted. However, such latent space distances primarily detect whether images are from different scanners, not whether they are correctly segmented. Therefore, we propose a novel segmentation distortion measure for uncertainty quantification. It is based on using an autoencoder to make activations more similar to those that were observed during training, and propagating the result through the remainder of the U-Net. We demonstrate that the extent to which this affects the segmentation correlates much more strongly with segmentation errors than distances in activation space, and that it quantifies uncertainty under domain shift better than entropy in the U-Net’s output.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_31

SharedIt: https://rdcu.be/dnwBr

Link to the code repository

https://github.com/MedVisBonn/Segmentation-Distortion/

Link to the dataset(s)

https://portal.conp.ca/dataset?id=projects/calgary-campinas

https://humanheart-project.creatis.insa-lyon.fr/database/#collection/637218c173e9f0047faa00fb

https://www.ub.edu/mnms/

Reviews

Review #1

Please describe the contribution of the paper

The authors propose a “segmentation distortion” method for uncertainty estimation. Without modifying the architecture, they train an autoencoder on network features. The autoencoder is trained to both obtain good reconstructions of the activations and to have a minimal impact on the segmentation. At test time, the reconstructed features are propagated through the network and it is measured how going through the autoencoder impacts the segmentation mask. The method is based on the belief that only features that can be reconstructed accurately will produce high-quality segmentations.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Overall, I really like the method. Unlike previous work that considers the distance of test activations to the training data as a measure of uncertainty, the proposed approach directly assesses how this impacts the segmentation performance. As expected, this allows for a better calibration.
- As the autoencoder is only used to reconstruct network features, the additional computational overhead is minimal. Additionally, no change needs to be made to the model architecture or training procedure. The method also requires no access to OOD data during training.
- Figure 1 perfectly captures the core of the method.
- There are comparisons to several established feature-based uncertainty estimation methods, and one ablation (LM).
- The authors appropriately explore the limitations of their method in the OOD detection task (section 4.3).
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The authors allude to the fact that they tested denoising and variational autoencoders, but these led to worse results. These results should at least be included in the supplementary material.
- The decision to reconstruct low-dimensional features at the end of the encoder is reasonable. However, the authors state that doing otherwise would lead to blurring. This should ideally be backed by qualitative results, for instance in the supplementary material. The same goes for the weighting between the feature reconstruction and segmentation preservation losses.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I believe the findings are reproducible. The code will be made publicly available and the data already is. The authors also include download links.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- When citing multiple works together, please order these according to their place in the references.
- The datasets used are first named in section 3.1, before they are properly introduced or referenced. I would suggest at least briefly describing each dataset before mentioning it. Additionally, the dimensions of the bottleneck do not need to go at the top of the Methodology, which does not need to include such details of the experimental setting.
- Please describe better what is meant by “high-density regions”.
- Please clarify the statement “We train on the same data the U-Net was trained on” on page 4.
- One could argue that cropping images to an uniform shape to accommodate the AE (section 3.3) implies a change in the training procedure. Please explain how the method would handle cases where the data contains images of different dimensions.
- In the caption of Figure 2 you state “”mean (surface) Dice”. This is the only part in text where surface Dice is mentioned, which is a different metric to regular Dice. Do you report the regular Dice coefficient or surface Dice?
- In Figure 3, I do not understand why in the left plot we see results for both Siemens and Siemens (val), and in the right plot, we only see results for Siemens 3.0 (val). Please explain what “Siemens” is in the left plot in the caption.
- The boxplots in Figure 4 (left) are a bit difficult to follow. Consider including gridlines. Also, Accuracy is not the most suitable metric for OOD detection, maybe consider using AUROC instead.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

8
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I recommend accepting this paper as the proposed method is innovative yet simple and has multiple advantages, namely that it does not require changing the architecture or training, needs no OOD data and obtains truer calibration results than existing work. I believe the results are sufficiently thorough for a conference paper, through I make a few suggestions for additional experiments that could go in the supplementary material.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper proposes a segmentation uncertainty measure with the goal of assessing segmentation correctness rather than detecting OOD inputs. The method uses an autoencoder in the bottleneck layer of a U-Net. The rationale is that under domain shifts, the AE-reconstructed bottleneck layer could lead to a different segmentation than the original bottleneck layer. The authors argue that such differences may indicate segmentation errors. I do not really understand the mechanism behind this, but the experimental results support the authors’ claims.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well written
- The introduction carries a clear and simple motivation: uncertainty estimation should not simply estimate “OOD-ness”, but incorrectness. The problem the authors address is well-defined and worthwhile.
- The paper proposes a creative solution, although, as mentioned above, I do not fully understand why it works.
- It does work remarkably well though.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- I would appreciate some analysis of the effect of the auto-encoder on the downstream segmentation, which is currently missing. Some important questions would be: What is the effect of the AE-reconstructions? How do they influence the segmentation? Are there any implications for robustness? What is the segmentation performance when using the original vs. the reconstructed bottleneck?
- There are quite a lot of design choices (in which layer the auto-encoder is applied, reconstruction loss, auto-encoder type, loss balancing factor): Were these hyperparameters selected based on a validation set? Otherwise I would be a bit worried and would at least expect these comparisons to be reported.
- The proposed uncertainty measure (“Segmentation Distortion (SD)”) seems quite heuristic, although it seems closely related to the Brier Score (maybe the authors could elaborate on this). Furthermore, the proposed measure is global (which is desirable), but the difference in segmentations would be extremely interesting to also evaluate / visualise locally. At present, no such analysis (even qualitative) is present.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
- The authors stated that they report on the sensitivity regarding hyperparameters. However, this was not reparted w.r.t. the major design choices (see “Weaknesses”).
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- The figures could be improved: Fig. 1 does not use space very wisely. Some figures have fairly bad quality (esp. Fig. 4).
- The mathematical notation seems a bit unclean, as U(I) indicates that U is a mapping from I to the segmentation output, which conflicts with the composition U \circ r(x).
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes an interesting and innovative concept, but I would have appreciated further analysis of its core behaviour.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper presents a novel auto-encoder based image-level uncertainty estimation methods, and the authors investigated the proposed uncertainty measure against different types of domain shift. The proposed method can estimate the segmentation failure better than the traditional ood detection methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- This work proposed a novel strategy via measuring segmentation distortion when the intermediate latent feature of the network is harmonized by an autoencoder.
- The authors discussed the cost of conservative OOD detection strategy that it may reject images with rather good segmentation results.
- The proposed work achieved better confidence-dice correlation than the OOD detector (PM) in comparison.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Incomprehensive literature study in related works, mean entropy of Bayesian u-net outputs are not discussed like deep ensembles, mc-dropout, which are more reliable than mean entropy a single u-net output.
- The correlation between image-level uncertainty and the dice score were presented but the correlation factor can be misleading when the scatter plot of uncertainty-dice is not shown.
- Presentation needs to be improved, the writing of this paper is too colloquial.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The work should be reproducible as the authors will make the code available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- It would be interesting to compare the proposed method to mean entropy of Bayesian unet outputs.
- It would be more clinically beneficial to measure the difference between the estimated confidence (1-uncertainty) and the dice than simply evaluating the correlation factor only.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presents a novel method for image level uncertainty estimation and made some valuable discussions between the trade-off between rejecting more images and achieving reliable predictions. However, the presentation of the paper needs to be improved.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

Based on the three reviews, there is consensus that the proposed method for uncertainty estimation using an autoencoder is interesting and innovative. Reviewers appreciate the simplicity of the method, its ability to improve calibration, and the thoroughness of the experimental evaluation. However, there are also some concerns raised by reviewers regarding the lack of analysis of the effect of the autoencoder on the downstream segmentation, the need for further comparison with other uncertainty estimation methods, and the presentation of the paper. Despite these weaknesses, the paper has received overall positive reviews, and reviewers have recommended acceptance of the paper with minor revisions.

Therefore, based on the strengths of the paper and the reviewers’ feedback, I recommend accepting the paper with minor revisions. The authors should address the concerns raised by the reviewers, particularly regarding the analysis of the effect of the autoencoder on the downstream segmentation and the comparison with other uncertainty estimation methods. Additionally, the authors should revise the presentation of the paper to improve clarity and organization. Overall, the proposed method is innovative and has the potential for real-world impact, making it a valuable contribution to the field of medical image segmentation.

Author Feedback

N/A

back to top

Segmentation Distortion: Quantifying Segmentation Uncertainty under Domain Shift via the Effects of Anomalous Activations