Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Divyanshu Mishra, He Zhao, Pramit Saha, Aris T. Papageorghiou, J. Alison Noble

Abstract

Out-of-distribution (OOD) detection is essential to improve the reliability of machine learning models by detecting samples that do not belong to the training distribution. Detecting OOD samples effectively in certain tasks can pose a challenge because of the substantial heterogeneity within the in-distribution (ID), and the high structural similarity between ID and OOD classes. For instance, when detecting heart views in fetal ultrasound videos there is a high structural similarity between the heart and other anatomies such as the abdomen, and large in-distribution variance as a heart has 5 distinct views and structural variations within each view. To detect OOD samples in this context, the resulting model should generalise to the intra-anatomy variations while rejecting similar OOD samples. In this paper, we introduce dual- conditioned diffusion models (DCDM) where we condition the model on in-distribution class information and latent features of the input image for reconstruction-based OOD detection. This constrains the generative manifold of the model to generate images structurally and semantically similar to those within the in-distribution. The proposed model out- performs reference methods with a 12% improvement in accuracy, 22% higher precision, and an 8% better F1 score.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_21

SharedIt: https://rdcu.be/dnwcy

Link to the code repository

https://github.com/FetalUltrasound/DCDM

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a new model for detecting OOD samples in complex scenarios where in-distribution data has multiple classes and high similarity between ID and OOD classes. The model proposes two conditioning mechanisms, IDCC and LIFC, to generate similar images to the input for in-distribution data while detecting OOD samples effectively.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The challenges of OOD in fetal ultrasound are clearly introduced and solved by two conditioning mechanisms IDCC and LIFC, respectively.
    • The experimental results and ablation study demonstrate the effectiveness of the proposed method for OOD detection, outperforming several state-of-the-art methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • While the paper proposes a novel conditioned diffusion model for OOD detection, the motivation for applying this model to OOD is not clearly presented. It would be beneficial to explain the advantages of using a diffusion model over other generative models and how it specifically addresses the challenges of OOD detection. Additionally, the authors claim that the method introduced in previous work, AnoDDPM, is unsuitable for reconstruction-based OOD detection, but do not provide sufficient explanation or comparison with their proposed method. Further elaboration on this point would be helpful in understanding the strengths and limitations of both methods.
    • While the proposed method shows promising results for OOD detection in fetal ultrasound videos, it is important to note that the method is only validated on one dataset comprising 359 subject videos. Further validation on larger and more diverse datasets would be necessary to assess the generalizability and robustness of the proposed method.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper provides a detailed description of the proposed method and the experimental setup, including the dataset used, evaluation metrics, and implementation details. However, it does not mention the code or any information on how to access the implementation.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    See Q6.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The contribution of this paper is over-claimed and the motivation is not clear.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    The response is not convincing to address my concerns.



Review #2

  • Please describe the contribution of the paper

    The paper proposed a novel diffusion-based method for detecting out-of-distribution samples in fetal ultrasound heart view videos. The novel method is a dual-conditioned diffusion model (DCDM), conditioned on in-distribution class information and latent features of the input images. The contribution lies in that, the method improves the detection where the in-distribution (ID) samples exhibit complex anatomical structures and look similar to OOD samples.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper provided a good overview of the issues in previous proposed methods for unsupervised OOD detection, including the more recent ones that are based on diffusion models.
    • Proposed new modules to inform the OOD detection model of the appearance and semantics of the ID samples, namely: latent image feature conditioning (LIFC), which determines the shape and texture of a generated image using a pre-trained encoder, and in-distribution class conditioning (IDCC), which determines the classes that should be included in the generated image, using a pre-trained label encoder.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • To provide the f_cls during inference, the authors use a pre-trained CNN classifier to assign classes to OOD samples. In this sense, are OOD samples more of anatomical alteration or bearing anomalies that are not part of the healthy anatomy? In either case, the classifier needs to predict for a sample that is not in the same distribution of its training samples, the predictions may be less reliable.
    • How is the pre-trained encoder obtained? More to this, how do the authors evaluate how capable the encoder is in describing the input images, such that the f_img is rich enough to inform the model of shape/texture information?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors didn’t comment on whether the dataset used in the paper is public or whether the repository will be open-sourced. The authors describe the architecture of the model, but it might need more details to implement it as it is in the paper. The work will only be reproducible if the data and repository are available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • It is a bit unclear how the classes are defined, are they represented as a vector, i.e. a set of labels of anatomy present in the image, or segmentation labels?
    • Calculation of f_0, is it the product of cross-attention of f_img and f_cls?
    • How to determine a reasonably good threshold to distinguish ID and OOD samples, can one choose it without using OOD samples?
    • Have the authors try with different dimension for the feature vectors? If so, how will it affect the accuracy?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As a motivation, the authors provide a clear overview of existing methods and describe the issues with them. The paper introduces a diffusion-based model with two novel modules (LIFC, IDCC), conditioning the model on the image feature vector and class feature vector. The proposed modules aim to deal with the inter-class heterogeneity in ID samples as well as to distinguish OOD and ID samples when they look similar. The proposed model is shown to out-perform the baseline methods and also improve the detection efficiency of the work by Graham et al, 2022. The concerns lie in 1) the authors could discuss the reliability of using the class classifier to predict classes on OOD samples, and the expressiveness of the pre-trained encoder, as well as some details mentioned in the comment section, 2) reproducibility, will be an open-sourced repository and dataset to verify the results shown in the paper.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents a Dual-Conditioned Diffusion Model (DCDM) for out-of-distribution detection (OOD), applied to fetal ultrasound video datasets. Based on an existing diffusion model DDPM, the authors introduced two novel conditioning strategies, IDCC and LIFC, to guide the model to generate similar images for in-distribution input and dissimilar images for OOD input, and therefore distinguishing anatomies.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper presents a novel way of utilizing the diffusion model for out-of-distribution detection, which resolves the problems of the previous OOD detection methods (GAN-based is challenging to train, and likelihood-based methods assign high likelihood to OOD samples). The quantitative results confirm this point.
    2. The proposed conditioning methods IDCC and LIFC demonstrate great novelty and performance. The qualitative results are explainable and justify the use of both modules.
    3. Abundance of training and evaluation data.
    4. Well-written paper with a clear logical flow.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors used different datasets for training and evaluation, therefore the domain shift might be a concern.
    2. The authors claimed that their conditioning modules IDCC and LIFC should outperform previous conditioning methods, such as a simple concatenation, but this was not validated with any results. Also, there are other types of conditioning methods, such as class conditioning via adaptive normalization (refer to original DDPM paper), but no comparison was shown.
    3. Some details are missing (see below).
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    1. The authors did not discuss their code’s availability in the paper.
    2. It seems that they used a private dataset. They mentioned the size and the subjects, but the acquisition details are not provided.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Please discuss why you used different datasets for training and evaluation, and whether the domain difference is a concern.
    2. Briefly describe why distinguishing different anatomies is clinically important.
    3. The authors used cross-attention for IDCC and LIFC. However, the detail is missing. A block diagram is essential.
    4. The details of the in-distribution classifier are missing. What is the performance of this classifier? Does its classification being wrong during testing negatively impact the result?
    5. Rationale for the use of beta noise scheduling for the diffusion process. Normally a linear or a cosine schedule is used. At least provide a citation.
    6. The choice of the classification threshold “tao” is unclear. How 0.73 was chosen? From Fig.2, it seems that DCDM gives high confidence for the OOD image (0.68), which is pretty close to ID (0.77).
    7. The F-1 scores in Table 2 are missing.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has good novelty. However, some experimental details are missing. The different dataset is also an issue.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    There are still some concerns not been addressed:

    1. The designs of IDCC and LIFC are not justified.
    2. From the rebuttal, it seems the choice of threshold tao is based on the best F-1 scores on the test dataset. This overfits the test dataset, which is unreasonable.
    3. Some of my reviews are not responded in the rebuttal, for example, details of cross-attention, beta noise, and high confidence for the OOD image (0.68). For this reason, I keep my original score.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper presents a new out-of-distribution (OOD) detection method. The claim is that this method is especially suited to applications with “substantial heterogeneity within the in-distribution (ID), and the high structural similarity between ID and OOD classes.” The proposed solution is a diffusion model that is “conditioned on in-distribution class information and latent features of the input image for reconstruction-based OOD detection”.

    The proposed solution is not well justified. It is not clear why these dual conditioning tricks should improve OOD detection accuracy under this scenario. Moreover, the assumed scenario does not make much sense and seems to involve contradiction. How can you have “substantial heterogeneity within in-distribution” and “high similarity between ID and OOD”? How do you define ID and OOD then? In the example that you have given, specifically, cardiac images are ID and head/abdomen/femur are OOD, which does not satisfy these conditions.

    Methodologically, it is not clear why the proposed method should work. On page 2 the authors write: “IDCC is proposed to handle high inter-class variance within in-distribution classes and high spatial similarity between ID and OOD classes. LIFC is introduced to counter the intra-class variance within each class.” this is not clear to me. Why should these two (IDCC and LIFC) work in this way that you are describing? The input to section c (Figure 1) is the same as (a). If f_img and f_cls are useful features for OOD detection, the main model in (a) should be able to learn them automatically. What is especial about part (c)? On top of this, encoder (E) in (c) is fixed and not trained at all. Why should it help then? The x_t in (a) is already conditioned on the image. Reviewers 1 and 3 agree with me, while Reviewer 2 points our important points about the methods that need clarification.

    Other weaknesses:

    • Compared methods are not conclusive. There are various types of OOD detection methods, such as those based on the internal representation of the network.
    • Reviewer 3 points out that the clinical importance of the application is unclear.
    • “Given an in-distribution dataset comprising n heterogeneous classes, conditioning the model only on image-level features is insufficient.” Why not? Your label comes as the output of a CNN (at least in test); The encoder in c (Figure 1) should be able to capture the information in the image label as well.
    • Why should ALOCC generate images that are similar to OOD? It has not seen any such data during training.
    • Table 2 does not have statistical test values.
    • Results are limited by the fact that only one data type (fetal US) has been used. Reviewer 1 agrees.




Author Feedback

  1. Clinical & Diffusion-based OOD Motivation (M1, R3, R1): Clinical: A routine prenatal ultrasound video typically comprises 13 anatomies and their views. However, analysis models are usually developed for anatomy-specific tasks. (e.g fetal heart). To separate heart views in US videos and use them for detecting congenital heart diseases, we need an OOD detection system as proposed here. Diffusion: Diffusion models generate sharp and detailed features whereas GANs suffer from mode collapse. This can lead to an increase in reconstruction errors for ID samples (5 heart views in our case) making it harder to distinguish from OOD samples.
  2. Dataset details (M1, R1, R2, R3): All the data (ID & OOD) is from the same private dataset detailed in protocol paper XX and is separated into training (ID) and test (ID+OOD) respectively. The whole dataset contains 359 videos which consists of 5000 frames for training and 7471 frames for testing (Sup Fig.2). Our model is a frame-based approach, and the amount of data is considered abundant by R3. In this work, we focus on the application of fetal heart (ID) detection in ultrasound data and hence, we only test on fetal ultrasound data.
  3. Substantial ID heterogeneity & high similarity between ID & OOD (M1): The ID consists of five fetal heart views where there are significant local variations while the OOD classes, such as fetal abdomen, head, can share a significant spatial resemblance globally with these different heart views. Hence, our data exhibits both ID heterogeneity and high spatial similarity.
  4. Autoencoder details (M1, R2): The Autoencoder (AE = E+D) is pretrained separately on ID heart data and can successfully reconstruct input heart images (SSIM = 0.956) which verifies that features extracted by E are rich. We will include AE performance details in Supp. materials.
  5. Why LIFC, IDCC are needed (M1): In Fig 1(a), the input image(x_0) spatial dimension is reduced (224->64) by the encoder(E) and then highly noised(t=1000) by the forward diffusion process to yield z_t. This leads to the spatial detail loss. Only feeding z_t (ideally Gaussian noise) to the main model generates any arbitrary image from the ID. We require IDCC and LIFC to condition the model to generate images having the same heart class and spatial structure as the input.
  6. LIFC is insufficient (M1): E used to generate LIFC is trained to focus on image reconstruction and considers spatial details of the input images (shape and boundaries). Thus, the high-level class-specific information is not well considered by LIFC. IDCC encodes class information via the label encoder. This provides the heart view information to the diffusion model. LIFC alone will result in a generated image that spatially resembles the input but without features specific to the correct heart view (Fig. 3, Supp. Fig. 1).
  7. Comparison Methods (M1, R1): We compare our model with GAN-based [22], likelihood-based [14] and diffusion-based [9] methods. These methods utilize internal representations to detect OOD. Moreover, our comparison methods outperform internal representation methods like AutoEncoder-Mahalanobis, MemAE,etc. AnoDDPM [28] is similar to [9], but weakly noises an input image (t=250 rather than t=1000). This results in worse performance because most spatial details of input image are visible and hence the diffusion model can perfectly regenerate the input for both ID and OOD cases. Additionally, [9] outperforms [28] in their experiments thus prompting us to compare with [9] instead. 8.Reproducibility (M1, R1, R2, R3): Code is released at https:// github.com/FetalUltrasound/DCDM 9.Classifier details (R2, R3): The classifier is designed to handle the heterogeneity within ID samples (88% accuracy). For OOD samples, as it is not trained on them, it predicts a heart view and thus leads to generation of an image different from the input.
  8. Threshold (R2, R3): τ is calculated by finding the threshold corresponding to the best F1-score.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have responded to some of the technical questions. However, my main concerns remains: “The proposed solution is not well justified. It is not clear why these dual conditioning tricks should improve OOD detection accuracy under this scenario.” Given the reviewers’ recommendation and the ranking of the paper in my pile, I accept the paper.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal is not convincing, in my opinion. The proposed solution is not well justified, and it is not clear why the proposed method should work. Why a dual conditioning of a diffusion model should improve OOD detection under this scenario? More importantly, the assumed scenario does not seem realistic and the clinical significance/usefulness of this work is questionable.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The reviewers have recognized the strengths of the work, but they have also identified several areas that need improvement. Particularly, there are concerns about the clarity and justification of the motivation and clinical significance of applying this model to out-of-distribution (OOD) scenarios, which have not been adequately addressed even in the rebuttal. Furthermore, after reviewing the rebuttal, there are still lingering concerns regarding the design of IDCC and LIFC, the choice of threshold, and insufficient method details, among others. In light of these issues, I recommend that the authors incorporate the recommendations from the reviewers to enhance future versions of the paper.



back to top