Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Othmane Laousy, Alexandre Araujo, Guillaume Chassagnon, Nikos Paragios, Marie-Pierre Revel, Maria Vakalopoulou

Abstract

In medical imaging, segmentation models have known a significant improvement in the past decade and are now used daily in clinical practice. However, similar to classification models, segmentation models are affected by adversarial attacks. In a safety-critical field like healthcare, certifying model predictions is of the utmost importance. Randomized smoothing has been introduced lately and provides a framework to certify models and obtain theoretical guarantees. In this paper, we present for the first time a certified segmentation baseline for medical imaging based on randomized smoothing and diffusion models. Our results show that leveraging the power of denoising diffusion probabilistic models helps us overcome the limits of randomized smoothing. We conduct extensive experiments on five public datasets of chest X-rays, skin lesions, and colonoscopies, and empirically show that we are able to maintain high certified Dice scores even for highly perturbed images. Our work represents the first attempt to certify medical image segmentation models, and we aspire for it to set a foundation for future benchmarks in this crucial and largely uncharted area.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_58

SharedIt: https://rdcu.be/dnwD6

Link to the code repository

https://github.com/othmanela/medical_cert_seg

Link to the dataset(s)

http://db.jsrt.or.jp/eng.php

https://lhncbc.nlm.nih.gov/LHC-downloads/downloads.html#tuberculosis-image-data-sets

https://challenge.isic-archive.com/data/#2018

https://polyp.grand-challenge.org/CVCClinicDB/


Reviews

Review #2

  • Please describe the contribution of the paper

    The authors introduce the first work on certifying deep learning models for medical image segmentation. The proposed method is based on diffusion probabilistic models (DPMs), allowing the authors to use off-the-shelf models. Experiments on 5 medical image segmentation datasets and 3 model architectures show that the proposed method is able to maintain higher segmentation performance (Dice) than the competing method, SegCertify, even in the presence of a large perturbation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This is the first work towards certifying medical image segmentation models for robustness to perturbations. The paper is well-written for the most parts, with a good overview of the literature.

    2. The method has been explained fairly well and is easy to follow.

    3. Extensive experiments on multiple architectures, 5 datasets, and multiple noise distributions show the efficacy of the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The formulation in Section 3 says Y = {0, …, k} represents the k-classes. Assuming the background is also a class, there are (k+1) classes from 0 to k. Please fix this.

    2. The datasets, except for ISIC 2018, are small, making the corresponding test partitions quite small. 20% of JSRT and Montgomery are 49 and 28 images respectively, and for Shenzen and CVC-ClinicDB, these numbers are ~130. Given the small test sets, it is highly recommended that the authors either repeat the training and evaluation for multiple random seeds or perform statistical significance tests for reported performance differences, neither of which is present in the paper.

    3. While this paper might be the first to certify segmentation models for medical images, the comparison to other related certification methods is lacking, with SegCertify being the only competing method compared against. The authors should preferably compare their results against other relevant methods.

    4. In Figure 1, what do the white pixels denote? Take the skin lesion image for example. It is a binary segmentation task - lesions (brown) versus background (black). Please specify what the white pixels mean.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of this paper is low. In the reproducibility checklist, the authors have replied “Yes” to all the items, but almost all of them are missing from the paper or the supplementary material.

    1. “A clear declaration of what software framework and version you used.” -> missing.
    2. “The range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results.” -> missing.
    3. “Information on sensitivity regarding parameter changes.” -> missing.
    4. “The exact number of training and evaluation runs.” -> missing.
    5. “A description of results with central tendency (e.g. mean) & variation (e.g. error bars).” -> missing.
    6. “An analysis of statistical significance of reported differences in performance between methods.” -> missing.
    7. “The average runtime for each result, or estimated energy cost.” -> missing.
    8. “A description of the memory footprint.” -> missing.
    9. “An analysis of situations in which the method failed.” -> missing.
    10. “A description of the computing infrastructure used (hardware and software).” -> missing.”
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. The reference for ISIC 2018 is incorrect. Please cite the correct references as listed here: https://challenge.isic-archive.com/data/#2018.

    2. Please format the best values in Table 1 and 3 using a bold font for improved readability.

    3. For the percentage of abstentions, please specify either in the text or in the tables that the lower is better.

    4. In Section 4 on page 5, the sentence “Since randomized smoothing is applied to each pixel separately …” is not quite clear.

    5. In Section 5 on page 8, the sentence “We note that the single-step …, while it is faster” should be rephrased to state what the “it” is referring to.

    6. In Section 3 on page 3, please specify what $p_y$ denotes. Although it can be guessed what it stands for, it might be better to explicitly specify it for clarity.

    7. In Section 5 on page 7, the authors write “The main drawback however is that its Dice on unpertubed images drops significantly …”. Since no statistical significance tests have been performed, I would suggest the authors rephrase this to say something like “drops considerably”.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a well-written paper with a meaningful and well-defined contribution. Although the method appears to be sound and the evaluation has been carried out on multiple architectures and datasets with good results, I have raised concerns about the reproducibility of the experiments, specifically the small size of the test partitions of the datasets and the lack of repeated experiments or statistical significance tests to account for the small test sets.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    The paper presents the first certified segmentation baseline for medical imaging, which uses randomized smoothing and diffusion models to certify models and obtain theoretical guarantees. The authors conducted extensive experiments on five public datasets of Chest X-Rays, skin lesions, and colonoscopies, and empirically showed that their approach can maintain high certified Dice scores even for highly perturbed images. Their technique leverages off-the-shelf denoising and segmentation models and provides the highest certified Dice and IoU on multi-class and binary segmentation of five different datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper leverages randomized smoothing and diffusion probabilistic models to achieve state-of-the-art results on certified segmentation for medical imaging.

    -The paper proposes a comprehensive study on certified segmentation for medical imaging, which to the best of the authors’ knowledge, has not been done before.

    • The paper uses a denoising diffusion probabilistic model (DPM) that is inherently iterative, allowing for the use of randomized smoothing.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors’ contributions are not clearly stated in the introduction.

    • There is no ablation study for the proposed method.

    • The proposed method involves training a denoising diffusion probabilistic model, which is an iterative process that can be computationally expensive.

    • Complexity: The proposed method involves combining randomized smoothing with a diffusion probabilistic model, which may make the method more complex and difficult to understand for some readers.

    • While the paper discusses some of the limitations of randomized smoothing and diffusion probabilistic models, it could benefit from a more comprehensive discussion of the limitations of the proposed method.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors employed public datasets and the paper is easy to follow and understand, so it seems reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The paper proposes a method for certified segmentation in medical imaging using randomized smoothing and diffusion probabilistic models. The method involves training a denoising diffusion probabilistic model and using it in conjunction with randomized smoothing to obtain state-of-the-art results in certified segmentation. The paper also presents a comprehensive study on certified segmentation for medical imaging. The authors employed public datasets and the paper is easy to follow and understand, so it seems reproducible. The authors’ contributions are not clearly stated in the introduction. There is no ablation study for the proposed method. The proposed method involves training a denoising diffusion probabilistic model, which is an iterative process that can be computationally expensive. The proposed method involves combining randomized smoothing with a diffusion probabilistic model, which may make the method more complex and difficult to understand for some readers. While the paper discusses some of the limitations of randomized smoothing and diffusion probabilistic models, it could benefit from a more comprehensive discussion of the limitations of the proposed method. Overall, the approach seems interesting and novel.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Technical novelty, reproducibility, and results achieved.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces a certified segmentation baseline for medical imaging that employs randomized smoothing and diffusion models. The authors utilize diffusion models to remove noise from perturbed images and then segment the denoised image. They showcase the relationship between the certified radius and the model’s performance under varying noise scales. In contrast to a recent certification method for natural image segmentation, the authors’ approach is more resilient to elevated noise levels, offering a larger certification radius. Furthermore, it can employ readily available denoising and segmentation models without necessitating significant fine-tuning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The paper presents a novel and significant contribution by introducing the first certification method for medical image segmentation. This has crucial implications for ensuring the robustness of clinical practice and providing a systematic approach to validate segmentation models, which is imperative for the accuracy and reliability of medical diagnosis and treatment.

    (2) The evaluation of the proposed method is comprehensive and convincing. The authors validate their certification approach on five publicly available datasets and compare it to a certification method for natural image segmentation. Their method exhibits robustness to noise, enabling much higher certification radiuses, while maintaining the original segmentation performance to a considerable extent. Additionally, the qualitative results presented in Figure 1 demonstrate that the method refrains from making predictions in highly uncertain areas, typically around the edges of the object of interest, ensuring higher precision and accuracy of the segmentation model.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) Method presentation: The paper’s presentation of ideas can be improved to make it more accessible to readers outside the certification domain. The description of the certification process, presented in multiple equations in free text on pages 4-5, can be challenging to follow and comprehend. It would be helpful if the equations were in separate equation blocks and supplemented with an overview figure to assist readers. For instance, the term “R” (certified radius) introduced on page 3 reoccurs in Table 1 on page 5, requiring the reader to search for its definition in the text, which is time-consuming. Including a notation table with all equations in the supplementary or having separate equation blocks could enhance readability. (2) Legibility and formatting: The paper’s legibility and formatting could be improved in a few ways. Table 1 is challenging to read, with closely packed numbers that are challenging to differentiate. Equation 1 does not have an index and is difficult to reference. Additionally, Table 1 lacks citations of the datasets and methods/architectures used. To support the claims in the text, it would be helpful to bold the best scores in Table 1, such as ResUNet++ being the most robust model.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility is very good. The authors only use public datasets and they will release their code. It is also possible to follow the descriptions in the paper to reproduce certain parts of their method.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Following the listed weaknesses of the paper, here are my detailed comments:

    (1)The ideas are presented in plain text, including equations and the entire workflow of the author’s method. I would suggest considering adding one holistic figure, which encapsulates the entire pipeline of the method if the space allows it, or even consider adding it to the supplementary. The figure is optional, however, having equations with indices is quite important, not only for following the paper but also when citing it. The authors should revise Section 4 during the rebuttal to fit the formatting criteria.

    (2) To improve the legibility in Table 1, the authors could simply add vertical lines between the anatomical structures, or highlight the best scores for each sigma to show that ResUNet++ is the most robust model, similar to Table 2.

    Small comments:

    • Please include a citation to back up this sentence in the introduction (“… for medical diagnosis, screening, and prognosis”).
    • Typo in Caption of Figure 1: there should be a comma before “and chest X-Ray”
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper presents a novel idea and has a strong evaluation, the presentation of the method and the legibility of the paper require a revision. Hence, I would opt for a weak accept with the condition that the authors revise the manuscript to conform the the formatting guidelines (see https://resource-cms.springernature.com/springer-cms/rest/v1/content/19242230/data/v11)

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper is novel and presents one of the first works toward certifying medical image segmentation models for robustness to perturbations. The evaluation is comprehensive and convincing and the paper is well-written. Please consider taking care of the questions/points of the reviewers for the camera-ready version.




Author Feedback

We greatly appreciate the feedback received by all reviewers as well as the meta-reviewer. The reviewers have noted the comprehensiveness, as well as the novelty of the approach, but we would like to address some of the comments regarding the weaknesses of our work. An ablation study for the denoiser was presented in the supplementary material. We tested our approach against a UNet denoiser that was explicitly trained for the denoising task on each noise level (0.25, 0.5, and 1). A quantitative comparison is presented in Table 3 of the supplementary material. We note that even with custom-trained denoisers, the Denoising Diffusion Probabilistic Model (DPM) achieves the best performance. Qualitative results are also presented in Figure 1 of the supplementary material and we notice that the DPM is able to keep high-fidelity images compared to the UNet, especially on higher noise levels. Also, it is important to mention that these results are achieved with a DPM that is not trained specifically for medical image denoising and is compared to a UNet architecture explicitly trained on chest X-rays. While we agree with R1 that adding a DPM on a randomized smoothing approach may make the method more complex, we believe that the gain in performance outweighs the downsides. In fact, we use a DPM off-the-shelf and our pipeline does not require the training of a new one since we empirically proved that the use of a DPM trained on ImageNet generalizes well to the medical image modalities we explored in this paper. An evaluation on random seeds of our models (R2) was not performed since our evaluation of the certification for each image ensures statistical guarantees. In fact, for each image, we sample 110 times from the segmentation model and perform multiple testing correction. Regarding our comparison (R2), Segcertify is the only exhaustive work that was found to tackle certified segmentation with randomized smoothing. Given that our code will be publicly available, we wish that it serves the community by encouraging further advancements in the certification of segmentation models in medical imaging. Regarding the limitations of our approach (R1), we have shown that certified Dice and IoU scores for smaller structures (e.g., clavicles on chest X-ray) drop more as the noise increases. Future work will involve testing our pipeline on more datasets with small structure segmentation. Thank you for the detailed feedback, we will be fixing all typos, citations, and formatting in the camera-ready version.



back to top