Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

G. Jignesh Chowdary, Zhaozheng Yin

Abstract

Diffusion model has shown its power on various generation tasks. When applying the diffusion model in medical image segmentation, there are a few roadblocks to remove: the semantic features required for the conditioning of the diffusion process are not well aligned with the noise embedding; and the U-Net backbone employed in these diffusion models is not sensitive to contextual information that is essential during the reverse diffusion process for accurate pixel-level segmentation. To overcome these limitations, we present a cross-attention module to enhance the conditioning from source images, and a transformer based U-Net with multi-sized windows for the extraction of various scales of contextual information. Evaluated on five benchmark datasets with different imaging modalities including Kvasir-Seg, CVC Clinic DB, ISIC 2017, ISIC 2018, and Refuge, our diffusion transformer U-Net achieves great generalization ability and outperforms all the state-of-the-art models on these datasets.


Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_59

SharedIt: https://rdcu.be/dnwD7

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces mainly two techniques to improve the diffusion-based segmentation model: (1) a cross-attention module that joins the image feature embedding and noise feature embedding; (2) a multi-sized module to improve the model capacity. Experiments show that the final model surpass prior selected works significantly in terms of segmentation metrics.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper successfully combines several advanced techniques, including diffusion, transformer and cross attention, achieving impressive performance over prior SOTA results.
    2. The multi-sized module is simple yet effective.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The comparisons are not fair between this work and prior arts that not used diffusion methods. (1) as in sec 3.2 “during inference an average ensemble of 25 predictions is considered as the final prediction”, I believe if the previous non-diffusion based methods ensemble 25 times, the metrics would also become significantly better;
    2. Lack of analysis on the model complexity and computation. The proposed method introduce many components and additional computation over the baseline. It is necessary to also compare how the model size and flops with prior works.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper should be easy to reproduce.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Compare this work with other diffusion-based method in terms of segmentation performance;
    2. Provide analysis and comparison of the model size and computation complexity;
    3. It would be better if the authors compare the multi-sized design with the inception module in CNN, both of which shares similar ideas and structure; e.g. inception v4 paper Inception-v4, “Inception-ResNet and the Impact of Residual Connections on Learning” (AAAI’17);
    4. Table 3 the “DC” and “IoU” should be switched.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper propose a diffusion-based segmentation model that significantly surpass previous results. Experiments show that all the proposed modules have considerable contribution to the final performance. Despite the impressive performance, the comparisons over prior works are not done thoroughly and can be further improved, especially from the aspect of efficiency.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes a Diffusion Transformer U-Net for medical image segmentation based on the diffusion model. The authors introduce a cross-attention module into the diffusion model to enhance the information obtained from the original image. They also propose a transformer-based U-Net, which uses multi-sized windows and linear self-attention to extract multi-scale features. Finally, the authors demonstrate the effectiveness of their method on multiple datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The CA module was proposed to integrate the information of the original image, which is innovative and beneficial for subsequent segmentation.
    2. The Transformer was used to improve UNet by introducing Linear Self-Attention, which reduced the computational complexity.
    3. Sufficient experiments were conducted on multiple datasets to demonstrate the effectiveness of the proposed method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. No significant analysis was conducted on the saliency of the proposed method.
    2. The parameters and computational complexity of the compared models were not compared.
    3. The required computational resources for training were not described.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the proposed method is uncertain due to the lack of code provided by the authors in the supplementary materials.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. It is recommended to conduct a saliency analysis of the experiments to make the results more convincing.
    2. It is recommended to compare the computational and parameter complexity among the compared models.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The experiments are relatively sufficient.
    2. The method description is clear.
  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a novel segmentation framework based on UNet, leveraging the denoising diffusion probabilistic model (DDPM) and transformers. During the training, the diffusion (forward) process was applied to gradually add random Gaussian noise through 1000 time steps onto ground truth labels; in the reverse process, at each timestep, an encoder with two residual-inception blocks was employed to extract feature embeddings from image and noisy label. The embeddings are then fused by a cross-attention (CA) module to improve the conditioning of the diffusion model, and fused embedding is fed into a transformer-based UNet with multiple window size (Multisize-transformer), which will denoise the noisy label and output the segmentation mask.

    Authors conducted extensive experiments. There are great improvements from baseline models and moderate improvements from previously reported results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Proposed a novel perspective to combine the segmentation tasks with DDPM.
    2. Used a cross-attention model to fuse the features from the image and noise label during the reverse process for better conditioning on diffusion model.
    3. Proposed an MT U-Net, improving over UNet and vanilla-transformer-based UNet
    4. Extensive ablation studies effectively justify the contribution of each component.
    5. Extendability: Results in supplementary materials suggest that the proposed diffusion process (DDPM with CA) can be employed as a plug-in to enhance various segmentation backbones.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Although I appreciate authors’ effort in conduction very extensive experiments, I found some inconsistent results: Results in Table 3 seem to be messed up and don’t match the results in Table 1/2: DC and IoU columns seem swapped, DC in Table1/2 matches the IoU in table 3.
    2. Improvement over SOTA results is limited compared to improvement over baseline models.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Code will be released as claimed in the manuscript. Datasets are all public. The work is reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Future directions: the authors only conducted experiments in RGB medical images and lacked investigation in gray-scale medical images, such as MRI and CT, which are important components in medical image computing. It would be great if authors could add some results in such datasets, such as organ segmentation (AMOS 2022 dataset).
    2. It would be more helpful if the authors could elaborate on the loss function (equation 2).
    3. Please correct any errors/typos in the table as pointed out in the section of weaknesses.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is clearly presented. A novel segmentation framework was proposed leveraging UNet, transformer, and DDPM. Given the great performance and extendability, I would recommend a strong acceptance of this paper.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes a novel segmentation model that combines UNet, denoising diffusion probabilistic model (DDPM), and transformers. It introduces a cross-attention module to enhance information from the original image and a multi-sized module for extracting multi-scale features. The experiments demonstrate significant improvements over previous methods. Reviewers appreciate the novel techniques and extensive evaluations but suggest improvements in comparisons, analysis of model complexity, and providing code for reproducibility. Despite minor weaknesses, the paper receives a positive overall score and is recommended for acceptance.




Author Feedback

We would like to express our sincere gratitude to all the reviewers for their meticulous assessments. We are pleased to see that the reviewers agree on the novelty and efficacy of the proposed model for medical image segmentation. Some major questions are addressed below.

Q1. The comparisons are not fair between this work and prior arts that not used diffusion methods. (1) as in sec 3.2 “during inference an average ensemble of 25 predictions is considered as the final prediction”, I believe if the previous non-diffusion based methods ensemble 25 times, the metrics would also become significantly better. (R1)

Response: We would like to clarify that non-diffusion approaches are normally non-stochastic in nature. Therefore, performing an ensemble of predictions would yield the same output as a single prediction. In contrast, our diffusion model is stochastic as we sample noise from the normal distribution for each prediction. Therefore, we consider an ensemble of 25 predictions, following the protocol from prior diffusion-based segmentation models (such as MedSegDiff and MedSegDiff-V2). Hence, we believe that the comparison between our model and non-diffusion models is fair.

Q2. It would be better if the authors compare the multi-sized design with the inception module in CNN, both of which shares similar ideas and structure; e.g. inception v4 paper Inception-v4, “Inception-ResNet and the Impact of Residual Connections on Learning” (AAAI’17). (R1)

Response: We acknowledge the reviewer’s suggestion, and we agree that the main idea behind our multi-sized design and the inception module is multi-scale feature extraction. In the inception module, this is achieved by employing convolutional filters of various sizes. In our work, we employ multi-sized windows for computing linear attention to achieve multi-scale feature extraction.

Q3. Typographical errors in Table 3 (R1, R2, R3)

Response: We appreciate the reviewers for bringing these typographical errors to our attention. We assure you that these errors will be corrected in the final version of the article.

Once again, we sincerely thank the reviewers for their valuable feedback, and we will take all of their comments into consideration for the final version of our paper.



back to top