Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Matthew Baugh, Jeremy Tan, Johanna P. Müller, Mischa Dombrowski, James Batten, Bernhard Kainz

Abstract

There is a growing interest in single-class modelling and out-of-distribution detection as fully supervised machine learning models cannot reliably identify classes not included in their training. The long tail of infinitely many out-of-distribution classes in real-world scenarios, e.g., for screening, triage, and quality control, means that it is often necessary to train single-class models that represent an expected feature distribution, e.g., from only strictly healthy volunteer data. Conventional supervised machine learning would require the collection of datasets that contain enough samples of all possible diseases in every imaging modality, which is not realistic. Self-supervised learning methods with synthetic anomalies are currently amongst the most promising approaches, alongside generative auto-encoders that analyse the residual reconstruction error. However, all methods suffer from a lack of structured validation, which makes calibration for deployment difficult and dataset-dependant. Our method alleviates this by making use of multiple visually-distinct synthetic anomaly learning tasks for both training and validation. This enables more robust training and generalisation. With our approach we can readily outperform state-of-the-art methods, which we demonstrate on exemplars in brain MRI and chest X-rays. Code is available at https://github.com/matt-baugh/many-tasks-make-light-work .

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_16

SharedIt: https://rdcu.be/dnwce

Link to the code repository

https://github.com/matt-baugh/many-tasks-make-light-work

https://github.com/matt-baugh/pytorch-poisson-image-editing

Link to the dataset(s)

https://www.humanconnectome.org/study/hcp-young-adult

https://www.med.upenn.edu/sbia/brats2017/registration.html

https://www.smir.ch/ISLES/Start2015

https://physionet.org/content/vindr-cxr/1.0.0/


Reviews

Review #2

  • Please describe the contribution of the paper

    Proposed to use synthetic anomalies instead of real anomalous data to solve the anomaly detection task. To generate the synthetic data, they apply patch blending, image deformations, and intensity modulations. Identifying each data augmentation can be seen as a separate anomaly detection task for the model to learn/solve. The average performance on the various tasks is used for validation (i.e., select the best model).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Validating on synthetic tasks is useful as it does not require collecting anomalous data samples.

    The authors show that their method outperforms prior works.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Synthetic anomalies do not exist in the real world. Therefore, I am not sure if the model is actually identifying anomalies of clinical significance (e.g., differences specific to disease pathology). Can the authors elaborate more on how they select the data augmentations and what kind of disease pathology they are trying to simulate if any?

    This method is not very generalizable, as it relies on specially chosen data augmentations. Mainly applicable for visual anomalies. In comparison, the reconstruction-based anomaly detection techniques can be used for more modalities (e.g., images, time-series signals). Morever, the evaluation is done only on MRI images. Would the proposed data augmentations also work for say CT images?

    The authors evaluate different approaches for the brain and chest datasets. For example, in Table 1, NSA is evaluated for the chest dataset but not the brain dataset. Can they clarify on the difference in evaluation. They also seem to be missing comparison with the Cutpaste method.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Easy to reproduce. I appreciate that the authors have included their code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please refer to weaknesses.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Evaluation is done properly and tackles an important problem. However, I am unsure if the method will generalize well to other modalities.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors here propose a self-supervised anomaly detection method with multiple synthetic tasks in training and validation stages of the model. The authors also claim that the proposed framework, that uses novel synthetic tasks in validation, the model can better detect and localize unseen anomalies than other similar SOTA methods. Experiments on chest X-ray and brain MRI datasets show improved performance than included baselines.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper proposes a simple yet effective framework that can improve any existing architecture’s anomaly detection performance and potentially improve detecting unseen anomalies. They have included a variety of state-of-art methods to compare to for semi-supervised anomaly detection methods which helps in comparing their methods. The intuition for why synthetic tasks improve the performance of model and generalization is explained aptly at various sections of manuscript where relevant.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It is not apparent from the writing in methods and reported results table what is the backbone architecture used in “ours” - proposed method. I would assume author’s have compared their framework of adding synthetic tasks in train and validation stage improves performance, but compared to which one? Is that architecture same for reported “ours” results in Table 1 for dataset Brain MRI and Chest X-Ray? The title of paper includes “localization”, however paper reports results for AP/AUROC which would be ideal for detection and not localization. I would prefer to have results of IoU when claiming improved localization for an explicit claim.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors indicate to publicize their code upon acceptance.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    I would recommend authors highlight improved detection performance in title of the paper instead of use of “localize” for which results have not been reported.

    I would also recommend to explicitly add what architecture they have used for comparison to other baselines included in Table 1.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors have proposed a novel yet simple framework to add synthetic tasks in training and validation stage of a model for anomaly detection. They have done decent amount of experiments to demonstrate in effectiveness and hence is valuable research effort. I would suggest to consider the 2 points mentioned above to make it more easy to read.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    This paper proposes a framework that uses multiple synthetic tasks to both train and validate a self-supervised anomaly detection model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The presented idea is novel and seems to outperform the baseline methods compared in the manuscript. Code has been provided with anonymous GitHub - making it easy to reproduce the results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The synthetic “tasks” presented here can also be considered augmented versions of the dataset. Based on my understanding, the task remains the same: detection of an anomaly. So, how it is a multi-task framework, I’m unable to understand. Maybe the authors need to explain this in a clearer manner. I agree with the statement of the authors that “this is the first work to train models to directly identify anomalies on tasks that are deformation-based”, but how does it differ from simply augmenting the dataset and then trying to detect the anomaly from that? “tasks that perform efficient Poisson image blending in 3D volumes” - might have been performed for the first time, but extending Poisson image blending from 2D to 3D can hardly be considered as a contribution.

    ceVAE from the year 2018 has been used as one of the baselines in the manuscript. One of the latest papers, “StRegA: Unsupervised anomaly detection in brain MRIs using a compact context-encoding variational autoencoder”, by Chatterjee et al. was published in 2022 that outperformed ceVAE. BraTS was one of the datasets evaluated there, also used in the current paper in review, and the StRegA pipeline performed exceptionally better. Authors should consider adding StRegA or cceVAE, or other recently published methods as baselines.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Code has been provided with anonymous GitHub - making it easy to reproduce the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Some method-related clarifications are required to improve the readability of the manuscript (as mentioned under the weakness point). Recent baselines should be added that seems to outperform baseline used in this paper. The authors should present a summary of the results (e.g. scores of their method vs baseline) in the abstract.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea is novel, seems to outperform the baselines. But needs improvement to make it more acceptable.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes to leverage multiple synthetic tasks to both train and validate a self-supervised anomaly detection model. Experimental results demonstrated clear improvements. All the reviewers confirmed the merits of this paper on interesting idea and improved performance over baseline methods. Other concerns, including further elaboration on implementation details, comparison over existing studies and simulation mechanism, should be thoroughly addressed in the final version. Therefore, a decision of Provisional Accept is recommended.




Author Feedback

We thank the reviewers for their kind words and constructive feedback. We sincerely appreciate that they found our method novel and that they saw the importance of anomaly detection as a problem. In the following we address the raised concerns, highlighting changes that we will make in the final version of the paper:

  • R1 “how is it a multi-task framework”: Our use of multi-task refers to training using multiple synthetic tasks, each of which generates a distinct type of anomaly, leading to a different set of anomalous features being learnt. This is in contrast to previous work which has only experimented with using a single task. We will rephrase the use of the term “multi-task” to avoid this confusion.
  • R1 “how does it differ from simply augmenting the dataset and then trying to detect the anomaly from that?”: Data augmentation transforms images whilst preserving their semantic content with the aim of encouraging invariance. In contrast, our method actively introduces different types of synthetic anomalies into otherwise normal data. Models are then trained to identify these anomalies, with each type potentially leading to the models learning a different set of features. In this way, the tasks are not extending the existing dataset, as the original data contains no anomalies, so could not be used to directly train an anomaly detection model. We will reiterate the high-level idea of introducing anomalies into healthy data in the first Method paragraph.
  • R2 “Can the authors elaborate more on how they select the data augmentations and what kind of disease pathology they are trying to simulate if any?”: These synthetic anomalies do not aim to simulate specific pathological features, as this would lead to similar issues to constructing a training dataset with a limited set of diseases and using that to train an anomaly detection model. Instead, they aim to train the model to identify a wide variety of subtle, well-integrated anomalies, under the assumption that being able to identify such minor deviations from the normal distribution will enable the model to also identify real-world anomalies. We will specify this in the first Method paragraph.
  • R2 “I am not sure if the model is actually identifying anomalies of clinical significance”: We agree that there is a need to be cautious before assuming that a model trained on synthetic anomalies is able to generalise to real-world data, which is why we chose test datasets that cover multiple pathologies and modalities (brain MRI - 3 pathology types, VinDr - 22 different local labels) to ensure that the model is not learning features specific to just one disease. We will highlight this motivation in the Data section.
  • R2 “the evaluation is done only on MRI images. Would the proposed data augmentations also work for say CT images?” All evaluation is performed on both brain MRI and chest X-rays. We suspect that our method would also perform well on CT images as the CT part of the MICCAI Medical Out-of-Distribution Analysis challenge has been consistently won by synthetic-task-based approaches.
  • R3 “what is the backbone architecture used”: We use a U-Net, with the only difference between the brain MRI and chest X-ray experiments being the dimensionality of the convolution layers (3D vs 2D), and the network depth (see supplementary material).
  • R1 “Authors should consider adding StRegA”: We will add StRegA to our comparison in future work, however, a direct comparison can not currently be made as although they also test using BraTS, they use different training data (including clinical data in both their train and test sets, where we have a domain gap between research training data and clinical test) and perform all their evaluation on thresholded outputs (where we consider the continuous anomaly maps to assess the anomaly maps ability to highlight anomalous regions without the additional complication of choosing a classification threshold).



back to top