Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Julia Wolleb, Florentin Bieder, Robin Sandkühler, Philippe C. Cattin

Abstract

In medical applications, weakly supervised anomaly detection methods are of great interest, as only image-level annotations are required for training. Current anomaly detection methods mainly rely on generative adversarial networks or autoencoder models. Those models are often complicated to train or have difficulties to preserve fine details in the image. We present a novel weakly supervised anomaly detection method based on denoising diffusion implicit models. We combine the deterministic iterative noising and denoising scheme with classifier guidance for image-to-image translation between diseased and healthy subjects. Our method generates very detailed anomaly maps without the need for a complex training procedure. We evaluate our method on the BRATS2020 dataset for brain tumor detection and the CheXpert dataset for detecting pleural effusions.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_4

SharedIt: https://rdcu.be/cVRYH

Link to the code repository

https://gitlab.com/cian.unibas.ch/diffusion-anomaly

Link to the dataset(s)

https://www.med.upenn.edu/cbica/brats2020/data.html

https://stanfordmlgroup.github.io/competitions/chexpert/


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper introduces a a novel weakly supervised anomaly detection method based on denoising diffusion implicit models. Experiments are conducted on BRATS2020 dataset for brain tumor detection and the CheXpert dataset for detecting pleural effusions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is mostly well-written and clear.
    2. The method achieves good performance for anomaly segmentation.
    3. Most of the aspects are described in sufficient detail to enable the reproduction of results.
    4. Denoising Diffusion Implicit Models seems a good adaptation to anomaly detection in medical images
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Denoising Diffusion Implicit Models seems not novel enough given that the authors did not provide many modifications to the previous method [25].
    2. The experiment lacks quantitative comparisons.
    3. It would be better if the author can compare with some of the SOTA unsupervised anomaly detection methods and semi/weakly-supervised classification/AD methods. Right now, one cannot justify whether the proposed method is promising or not.
    4. The motivation of why denoising diffusion implicit models works better than the traditional Generative methods is not clearly discussed.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have produced the anonymous codes.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The paper is well written and clear to understand. However, I have some concerns about the experiments and the novelty. As discussed in above, it would be better if the author can compared with some baselines . Moreover, the novelty can be considered as insufficient given that it was proposed in previous paper [25].

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the technical novelty of the proposed method is not enough and lack some vital experiential analysis. Hence, I recommend a weak reject for the paper right now.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The paper proposed an anomaly detection method using Denoising Diffusion Implicit Models (DDIM). The key idea is to iteratively add noise and learn to subtract it from input images using DDIM. During the anomaly detection stage, the anomaly image is treated as the noisy image and the learned network is used to generate a healthy image. Finally, their difference is used to calculate the anomaly map.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The key strength of the paper is the idea to use diffusion processes in medical image denoising. In this case, denoising is treated as an image synthesis problem (generate a healthy image), where diffusion models have shown to preserve more details and generate higher-quality images as compared to GANs, which are traditionally used in this domain. In addition, the experimental validation of the paper is thorough and demonstrates well the capabilities of the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper seems to be mainly an application of an existing technique from computer vision to medical imaging.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The model is very clearly explained and the code is included. Evaluation is performed on public datasets. I commend the authors on releasing very clear instructions in the code repository.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Overall, I understood the main ideas of the paper and found it very clear. There were some specific points I did not understand:

    • How is the binary classifier used and why is it necessary?
    • How does the iterative noising process induce specific anatomy details (Section 2, page 4)? It seems like noising would add random, non-anatomical noise.
    • For results in section 3, are the anomalies found indicative of certain clinical findings or annotations?
    • How does the proposed approach compare to non-synthesis based anomaly detection methods, e.g., density estimation or feature modelling? Please see “Visual Anomaly Detection for Images: A Survey” by Yang et al. It would be helpful to add to the related work section and discuss.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper is mainly an application, it is an important extension of a well-known computer vision technique to medical images. As such, I believe it would be a valuable contribution to the community.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I have read the other reviews and the author response. While it would be interesting to compare to other anomaly methods, reconstruction methods seem closest to the proposed approach, therefore, I think comparing to them and discussing other approaches is reasonable.

    While there is a question of novelty with respect to prior work, this paper is a first idea of applying diffusion modeling to medical image analysis, and as such, is quite novel and would be valuable to the community.



Review #4

  • Please describe the contribution of the paper

    The paper proposes to use diffusion models trained on healthy patient scans to restore a ‘pseudo-healthy’ scan for scans with anomalies and use the residual between the original image and the ‘pseudo-healthy’ image as anomaly localization.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Great and novel Idea! (I personally feel like this idea was overdue already, I was thinking about trying that for some years now as well)
    • Experiments on multiple datasets.
    • Nice analysis of the effect of the hyperparameters of the method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The performance of the method is not entirely convincing (in some hyperparamter settings it seems to surpass the presented baseline).
    • The VAE baseline appears to have been trained very poorly.
    • Hardly any quantitative comparison.
    • Not sure using BRATS2020 as dataset for anomaly detection is the best choice (prevalence of tumors already gives a different data distribution between normal/healthy and abnormal/diseased slices (even when discarding the lowest and uppermost slices), furthermore it is hard to tell if a tumor has deformed some unlabeled part of the brain )
    • A discussion on using the “reconstruction” error / residual as anomaly localization score would be appropriate since it has some received some critics in the recent years (see Meissen paper)
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Looks good.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I think some points could give great benefits and more credibility to the paper:

    • Please check you VAE implementation and training, looks like something went wrong there (put at least as much effort in your baselines as in your own method).
    • Another dataset with a predetermined hyperparamter and training scheme setting and a realistic evaluation with quantitative numbers would give a better estimate of the performance of the method (e.g. a similar setting to the MOOD challenge).
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall while some aspects of the paper are lacking I think the idea is interesting and works reasonably well and as such could be accepted (no one expects a first interesting idea to work perfectly from the get-go).

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    An image anomaly detection method based on denoising diffusion implicit models is presented. The reviewers and I agree that this is an interesting idea and that the paper is written reasonably well. However, I agree with R3 that it seems strange that the paper implies that image reconstruction via GANs and VAEs is all there is for anomaly detection in images. R4 mentions the MOOD challenge, where only a minority of the methods try image synthesising approaches and there is an entire community in CVPR who do image anomaly detection through density estimation or smart design of self-supervision tasks. None of this is mentioned in the paper and comparisons to these methods (some would call them state-of-the-art because of their excellent performance) are not made. I also agree with R4 that the VAE results don’t seem right in the figures and R2/R4 have a very valid point that no quantitative comparisons with other methods are provided. The source code repository is well made and already available, which adds a big thumbs up from the reviewers and myself. Overall I would ask the authors to help us with the final decision and provide a rebuttal commenting on and proposing potential fixes for

    • the strict focus on reconstruction-based methods and the exclusion of other, effective approaches.
    • the lack of quantitative comparisons to other approaches
    • the reasoning behind why denoising diffusion implicit models should work better than conventional models.
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6




Author Feedback

First, we want to thank all reviewers for their helpful and valuable comments.

All reviewers and the meta-reviewer point out that we did not compare ourselves to non-synthesis based anomaly detection methods such as density estimation or feature modeling. We are aware that there is a wide range of methods that we did not discuss, also because the field is very large, as pointed out by Reviewer #4. We decided to compare ourselves to related reconstruction-based approaches and chose two representative methods, i.e., a GAN- and a VAE-based approach. For quantitative comparison to those approaches, we compute the Dice score and the AUROC score. The results are presented in Fig. 4 as horizontal bars. We will also point this out in the text. We agree that non-synthesis based methods should be discussed in the related work, and will include the survey paper by Yang et al.. The comparison to those approaches is certainly interesting and will be part of future work.

Reviewer #1 mentions that the paper is not novel, as we adapt the DDIM sampling process. This technique was introduced by Song et al. for sampling synthetic images rather than image-to-image translation. We argue that our paper is the first one to perform image-to-image translation by combining the DDIM sampling process with classifier guidance. The main advantage – in contrast to DDPMs [10] – is that the identity-preserving mechanism is guaranteed with the iterative deterministic noising and denoising scheme. As the stochastic element is removed from the sampling process, it is possible to move backward and forward in the noising and denoising process without losing information. During the denoising process, using the gradient of the binary classifier, only pixels showing an anomaly are changed. The rest of the image, e.g. the background, is preserved. This results in a very accurate anomaly map. Thereby, the advantage of our method over GANs and VAEs becomes apparent. Furthermore, the training of GANs is very complex and unstable. A lot of architecture changes and additional loss terms need to be integrated such that image-to-image translation is detail-preserving and cycle-consistent. VAEs are easier to train but often result in blurry images, which renders the difference map inaccurate. Our method presents a straightforward solution to circumvent those issues.

Reviewer #3 further asked whether the results are indicative of clinical annotations. The results on the BRATS dataset show that our anomaly map corresponds to the manual ground truth segmentations, as can be seen by visual comparison, as well as in the Dice and AUROC scores presented in Figure 4. For the CheXpert dataset, we can only compare the images visually, as no ground truth is provided.

Reviewer #4 thinks that the BRATS2020 challenge is not the best choice. We agree that slices not containing a tumor might still be abnormal due to deformation, and tumor prevalence introduces a bias. However, this dataset is often used as a baseline dataset for medical anomaly detection. We still think that this application is fair as a first step, which can be refined on other datasets later on. As asked by Reviewer #4, we have double-checked our VAE implementation. One issue we observed was that the multi-channel setting for the different BRATS MR sequences was difficult, resulting in a high reconstruction error for T2-weighted images. In other papers [6,29,14], only one sequence/channel was taken into account, which made the reconstruction task easier. As proposed by Reviewer #4, a discussion of the reconstruction error as anomaly score would indeed be interesting and can be added to Section 4. Meissen et al. showed that a simple method based on histogram equalization could outperform neural networks and state that reconstruction quality does not correlate well with the Dice score. As an outlook, anomaly scores of other types of methods, i.e. the log-likelihood of density estimation models, can also be included.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper unfortunately lacks a valid evaluation and the authors dismiss state-of-the-art methods that may perform significantly better in their rebuttal. The only concession made is that they plan to add a reference to a survey paper, which lacks a proper citation in the rebuttal and at least I cannot find this work with the information given. Further experiments are also dismissed, e.g. CheXpert does have subsets with ground truth bounding box annotations that could be used. The authors also claim that the Brats dataset is “often used as a baseline dataset for medical anomaly detection”, which is in my opinion not true. I only know a handful of papers that did that in this space. I agree that Brats is probably not a good choice here. The authors also confirmed that their implementation/use of the VAE was potentially erroneous but do not commit to correct this problem. Some of these issues are critical, thus I cannot recommend to accept this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The main contribution of this work is a new image anomaly detection method based on denoising diffusion implicit models (DDIM). The main idea is to add noise to the images and learn to subtract it using DDIM. For detection, a new image is treated as the noisy image to generate a ‘healthy’ image, and the difference is the anomaly map.

    Key strengths:

    • It is an interesting idea, and novel in the medical imaging field.
    • It seems to improve relevant state-of-the art methods.

    Key weaknesses:

    • No quantitative analysis or a discussion with other methods (other than GAN and VAT) is provided
    • Technical novelty (not application) may be considered limited

    Review comments & Scores: R1&2 were concerned about the lack of discussion on the motivation to use denoising diffusion implicit models instead of other models such as generative models, and the lack of a comparison analysis or discussion on similar methods. Reviewers were also concerned about the technical novelty as “seems to be mainly an application of an existing technique from computer vision to medical imaging” (R3).

    Rebuttal: Authors acknowledged the large number of non-synthesis methods and stated that they chose 2 representative models (GAN, VAE) because reconstruction-based approaches are more related to the proposed work, therefore a fairer comparison.

    Evaluation & Justification: The rebuttal has answered successfully my concerns by explaining the main differences with GAN and VAT methods, and why they chose these two methods.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Relevant and timely topic, weaknesses are well answered. This should be accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    upper



back to top