Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Mark S. Graham, Walter Hugo Lopez Pinaya, Paul Wright, Petru-Daniel Tudosiu, Yee H. Mah, James T. Teo, H. Rolf Jäger, David Werring, Parashkev Nachev, Sébastien Ourselin, M. Jorge Cardoso

Abstract

Methods for out-of-distribution (OOD) detection that scale to 3D data are crucial components of any real-world clinical deep learning system. Classic denoising diffusion probabilistic models (DDPMs) have been recently proposed as a robust way to perform reconstruction-based OOD detection on 2D datasets, but do not trivially scale to 3D data. In this work, we propose to use Latent Diffusion Models (LDMs), which enable the scaling of DDPMs to high-resolution 3D medical data. We validate the proposed approach on near- and far-OOD datasets and compare it to a recently proposed, 3D-enabled approach using Latent Transformer Models (LTMs). Not only does the proposed LDM-based approach achieve statistically significant better performance, it also shows less sensitivity to the underlying latent representation, more favourable memory scaling, and produces better spatial anomaly maps. Code is available at https://github.com/marksgraham/ddpm-ood.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_43

SharedIt: https://rdcu.be/dnwdp

Link to the code repository

https://github.com/marksgraham/ddpm-ood

https://github.com/marksgraham/transformer-ood

Link to the dataset(s)

http://medicaldecathlon.com/


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a method for detecting far and near out-of-distribution (OOD) 3D images, based on diffusion models (DM) applied on a discretized autoencoder latent space. The proposed architecture leverages acknowledged OOD detection performance of DM based on reconstruction error maps, in the image space, and OOD detection performance of latent transformer models (LTM) in the latent space. The model first projects the input 3D image in the latent space of a VQ-GAN, then it learns to denoise this compressed representation based on denoising diffusion probabilistic model, before decoding them. OOD detection is finally performed in the original image space based on standard reconstruction error. The method is evaluated on near and far OOD tasks and compared to performance of LTM.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -OOD detection is a timely topic, especially to explain silent failures of DL diagnostic and pronostic models. -The paper is well written and organized. The justification of the proposed architecture is well argued. -The authors perform a well conducted comparative analysis with LTM.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I suggest improving performance analysis to strengthen the soundness of the study. First, the proposed analysis is restricted to the comparison with LTM models, while other architectures, such as the one proposed in (Gonzales et al, MEDIA22) are not even referenced. Please consider including them in the evaluation, and/or discuss them. Second, performance evaluation consists of a binary classification task of ID and OOD images based on the derived anomaly score map. It would be very interesting to design more challenging diagnostic tasks, eg segmentation, to evaluate the added value of the derived OOD score in detecting silent failures.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors mention that they will make code available upon acceptance. This would indeed increase reproducibility of the experiments and somehow counterbalance the fact that performance is evaluated on a private dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please consider addressing the points listed in section 6.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Interesting topic and well written paper but the paper lacks performance analysis on a more challenging task to demonstrate the added value of the method.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a out-of-distribution detection method based on latent diffusion model, which can be scaled to 3D medical volumes because of less memory footprint. The basic idea is that diffusion model will only successfully denoise in-distribution data. They compare LDM method with previous state-of-the-art latent transformer models (LDM) on far-OOD and near-OOD settings, and results show LDM has better performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • This paper gives thorough analysis on previous LTM methods, including its advantages, and then introduces how proposed method address these advantages • This paper shows results on both far-OOD and near-OOD, which is a interesting way of evaluating unsupervised OOD detection.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Limited novelty. Many key designs in this work, like utilizing diffusion model to reconstruct images for OOD detection, combining MSE and perceptual similarity as measurement of similarity, are the same in work “Denoising di‚ܵusion models for out-of-distribution detection” by Graham, M.S., et al, 2022. Replacing DDPM with latent diffusion model and extending 2D to 3D are just incremental works. • Weak motivation. Considering the complex workflow of LDM, long inference time, and high GPU memory usage, the application of this 3D OOD detection method would be limited. For example, in a clinical setting, we will wait around 30 seconds just for knowing if the input volume is out of distribution. I believe even less experienced doctors can tell if it is OOD within 2 seconds. Additionally, is it necessary to conduct OOD detection on the entire volume? If one volume is OOD, every slice of the volume should also be OOD. Then the 3D OOD problem can be easily converted to a 2D OOD problem. • Deficincy in experiment design. The only basline used for comparison in Table 1 is LTM, and I believe it is necessary to include more for better evaluation. Also, I’m confused by the choice of tasks in near-OOD, like chunk top, chunk middle, and skull stripped, are these tasks meaningful for clinical practice?

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Latent transformer model has open-source codes and this paper also gives details on model architecture and hyperparamters. Therefore, reimplementing the proposed method is not very difficult. However, part of the dataset is not public and I’m not sure if the author will enclose the dataset in the future.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    • Work [10] shows the similarity between reconstructed images and original images is highly impacted by the noise level t, it would be better if author can list ablation study on t and explains the choice of t values [100, 200, 300, 400]. • Secton 2.3 is more like a experiment setting, and it can be oraginized better by moving it to section 4. • For section 4 Results & Discussion, the first parahraph is really long. To make it more readable, I suggest that the author can add numbering for the key observations or split it into different parapraphs. Some sentences can be paraphrased for better clarity:

    1. In section 2.2, “Prior works have shown the performance becomes dependent on the choice of the bottleneck - too small and even ID inputs are poorly reconstructed, too large and OOD inputs are well reconstructed”
    2. In title of Table 1. “Tests for difference in AUC compare each LTM and LDM models with the same VQ-GAN base”
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Using latent diffusion model for 3D OOD is an incremental work to previous works, and the author does not give convincing motivations for 3D OOD regarding its high computational cost and latency, while 2D OOD can be a better alternation. In addition, inadequate evaluation also leads me to the decision, when the model is 4-level, LDM is better than LTM at only two tasks with stastically significance

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    3

  • [Post rebuttal] Please justify your decision

    After reviewing the authors’ feedback, I am further convinced that this work lacks the critical study needed to demonstrate why simply using a subset of 2D slices for near-OOD is not a good solution. Although the author argues that some near-OOD tasks cannot be converted to 2D slices, all far-OOD and near-OOD tasks in Table. 1 can all be handled in 2D. Apparently, treating these tasks in 2D slices would greatly reduce training time and inference time. Lacking such critical comparison is not acceptable.

    On the other hand, in terms of latency, I don’t agree that the later stages could take 10 more minutes, for example, many classification models would take less than 5s to get the prediction for a 3D volume.

    Novelty: “Denoising diffusion models for out-of-distribution detection” by Graham, M.S., et al, 2022. already explored DDPM for OOD detection.

    Experiments: The experiments results in Table 1 lack other evaluation metrics and fail to show that LDM is better than LTM for 4-level settings. Ablation studies are also missing.



Review #6

  • Please describe the contribution of the paper

    The paper uses the VQ-GAN + latent diffusion models to detect OOD samples in 3D medical data, showcasing improved results compared to its competitors.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The usage of LDM for OOD detection with 3D data is novel
    • The experiments showcase an impressive improvement over LTM both qualitatively and quantitatively
    • Paper is easy to understand
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The issues with 3D data in current DDPM should be explained as it is a bit hard to completely idently authors contributions.
    • Visual results on far - OOD should be provided to see the comparison with LTMs
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good reproducibility

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    This paper was easy to understand and appreciate. The concept of using LDM for OOD detection is pretty good and the way that they are used are novels.

    Here are some of the minor gripes I have with the paper

    • Self-containment: This paper pulls inspiration from other related works, and cites them while using there assertions. I think the paper would benefit with explaining some of the major ones that establish the motivation. For example (i) please describe why DDPM are non trivial for 3D data, (ii) what was used for perceptual loss, VGG? etc.
    • Section 2.2 can use more clarification as that is the novel contribution. Please indicate in more detail of how the z-score was computed, was there a different statistics for MSE and perceptual loss or some normalization of losses. Maybe include this in supplementary.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel and well-reasoned usage of ML models on OOD detection task. The experiments are described well and showcase very clear improvements.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Considering the rebuttal and rest of the reviews I stand by my initial assessment.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes a method for detecting out-of-distribution (OOD) 3D images using a combination of latent diffusion model and discretized autoencoder latent space. The paper evaluates the method on both far and near OOD tasks and compares its performance to that of LTM. The results show that the proposed method outperforms LTM on most tasks. However, the paper lacks performance analysis on more challenging tasks, which would demonstrate the added value of the method. The scalability of diffusion models to 3D data is not well explained to gauge the paper’s contribution. Concerns about the works novelty should also be addressed in the rebuttal. Additionally, the high computational cost and latency of the method are not convincingly justified, and the choice of tasks for near-OOD evaluation is questionable. Nonetheless, the paper presents a well-written and organized analysis with clear and easy-to-understand experimental results.




Author Feedback

Comparison to baselines (R1&2): We acknowledge it was not well explained why we only compared to LTM models. Our method is both 3D and unsupervised, meaning it does not need segmentation or classification labels for training. This allows our method to serve as an independent first step in a pipeline, before further processing is performed. There is little published work that is both 3D and unsupervised. For instance, the González paper [1] mentioned by R1 is excellent work but performs OOD as a supervised task requiring segmentation labels - a different set-up to that considered by us. Based on our literature review, we consider the LTM paper the only suitable fully 3D, unsupervised baseline. The only other potential baseline we’re aware of is using the reconstruction error from an AutoEncoder. However, this has been shown to fail for OOD in both 2D [2] and 3D [5], and we found it fails on our problem, too. We will update the paper to explain the lack of other suitable baselines, with references to adjacent works such as [1].

Limited novelty (R2&3): Reviewers suggested that given the method has been shown to work in 2D [2], its extension to 3D is trivial. It is not trivial, as simply scaling up a pixel-space DDPM to 3D is not possible – even extending DDPMs to high-resolution 2D data is an active area of research [3,4], and we have found that even on low-resolution 3D data, pixel-space DDPMs produce poor samples that would make reconstruction-based OOD impossible. To get the method to work in 3D it was necessary to move from a pixel-space DDPM to a two-stage LDM, involving training a first stage VQ-GAN and then a 3D, latent-space DDPM. As additional contributions, we demonstrate the method works in the near-OOD case, while [2] was only tested on far-OOD data, and show the method can provide high-quality anomaly maps, something not shown in [2].

Runtime/cost (R2): We’re unsure why R2 believes our method has high GPU memory usage – Table 1 shows it uses 1.5GB of GPU memory, easily achievable on a consumer-grade GPU, and half that used by the competing method. Regarding runtime: we see the use of OOD methods as the first step in clinical pipelines that run without human intervention; where later stages might involve a range of processing such as registration, segmentation, classification, and report generation, and could conceivably take >10minutes to run. In this context, we believe a 30s OOD detection stage is justified, especially given flagging a volume as OOD could save on computation by cancelling further analysis.

Necessity of operating in 3D (R2): R2 suggests the 3D problem can be easily converted to 2D by checking a subset of 2D slices. This isn’t the case for near-OOD data, where it may be impossible to tell if a slice is OOD without viewing adjacent slices (with e.g. missing slices, FoV artefacts). We would argue that treating 3D medical data as a set of 2D slices is always going to have drawbacks, and it is a strength of our method that it works in 3D.

Appropriateness of near-OOD data (R2): Most papers only evaluate in the far-OOD domain, because evaluation in near-OOD is hard – the definition of near-OOD is subjective. However, we think it’s important to evaluate in near-OOD and used synthetic artefacts to make the evaluation possible, as done in other works [1,5]. We tried to mimic a range of artefacts, such as those from acquisition (missing chunks due to FoV errors, high noise levels) to those in data post-processing (flips mimicking DICOM header errors, scaling to mimic intensity normalisation errors). However, we acknowledge the next step in realism would be to ask a clinician to label a large set of CTs to flag artefacts and evaluate on this. However, this is outside the scope of this study.

References: [1] sciencedirect.com/science/article/pii/S1361841522002298 [2] arxiv.org/abs/2211.07740 [3] arxiv.org/abs/2301.11093 [4] arxiv.org/abs/2106.15282 [5] proceedings.mlr.press/v172/graham22a.html




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Rebuttal has adequately addressed reviewers’ concerns to a large extent. I am not convinced that near OOD should be solved using a subset of 2D slices as near OOD can be manifested by subtle structural abnormalities that might not be captured in 2D slices. Authors are encouraged to incorporate relevant changes and clarifications based on reviewers’ comments in the camera ready.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Thanks for the rebuttal. I am convinced regarding the clarification of the technical novelty when 2D diffusion model is extended to 3D. However, I am not yet convinced regarding the performance analysis of the method on more challenging datasets. Currently, in Table 1, the AUC score often reaches 100%. Would this be a relatively easy task for the classifer?



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I agree with most of the answers by the authors, especially in regards to novelty, performance analysis on more challenging tasks, and differences in 2D vs 3D models.

    I believe this paper is of interest to the MICCAI community despite some issues, and suggest that we ask the authors to incorporate the comments raised by the reviewers and the original AC in their final paper.



back to top