Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Reza Azad, Amirhossein Kazerouni, Babak Azad, Ehsan Khodapanah Aghdam, Yury Velichko, Ulas Bagci, Dorit Merhof

Abstract

Vision Transformer (ViT) models have demonstrated a breakthrough in a wide range of computer vision tasks. However, compared to the Convolutional Neural Network (CNN) models, it has been observed that the ViT models struggle to capture high-frequency components of images, which can limit their ability to detect local textures and edge information. As abnormalities in human tissue, such as tumors and lesions, may greatly vary in structure, texture, and shape, high-frequency information such as texture is crucial for effective semantic segmentation tasks. To address this limitation in ViT models, we propose a new technique, Laplacian-Former, that enhances the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid. More specifically, our proposed method utilizes a dual attention mechanism via efficient attention and frequency attention while the efficient attention mechanism reduces the complexity of self-attention to linear while producing the same output, selectively intensifies the contribution of shape and texture features. Furthermore, we introduce a novel efficient enhancement multi-scale bridge that effectively transfers spatial information from the encoder to the decoder while preserving the fundamental features. We demonstrate the efficacy of Laplacian-former on multi-organ and skin lesion segmentation tasks with +1.87\% and +0.76\% dice scores compared to SOTA approaches, respectively.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_70

SharedIt: https://rdcu.be/dnwB4

Link to the code repository

https://github.com/mindflow-institue/Laplacian-Former

Link to the dataset(s)

N/A


Reviews

Review #3

  • Please describe the contribution of the paper

    The paper entitled: “Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection” proposed the use of a new Efficient frequency Attention that leverage the merits of two mechanism: efficient attention and frequency attention. In addition the introduced a bridge between the encoder and decoder that preserve fundamental features, they clammed improvement in performance for segmentation task when compared with SOTA.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The use of the attention mechanism that leverage the frequency to enhance the texture feature extraction is a very good contribution, especially for task related to biomedical image segmentation. The paper is also well structured and the diagrams very well explained. The number of experiments is quite fair.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The attention mechanism is well supported with mathematical representation, however this is some lack of mathematical support for the rest of the component in the proposed architecture. It would have been appreciate some output from the skin lesion dataset too.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper give good guidelines for reproducibility if the work

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please see my previous comments. The paper is well structured, I think the most relevant and novel contribution is the frequency attention, probably this could be highlighted a little more. Rather than that, probably more comparisons with SOTA, specially for the ISIC dataset.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper includes a novel approach for texture feature representation extraction by using frequency mechanism.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The authors proposed a Laplacian-Former, which reduces the complexity of self-attention to linear while selectively emphasize information at different frequency scale using Laplacian pyramids. The authors also proposed a novel efficient enhancement multi-scale bridge that connects the encoder and decoder. The proposed method outperforms other state-of-the-art methods on different public datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The main motivation of the paper is clear. In medical image analysis, it is important to analyze the fine details but attention mechanism tends to act like a low pass filter that erases high-frequency information.

    2. The authors utilized Laplcian pyramid in attention mechanism to preserve the high-frequency information during the attention process.

    3. The method achieve state-of-the-art results on different public datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The motivation of the efficient enhancement multi-scale bridge is not clear. Although it is obvious that some operations on the skip connection can help improve the results in general, it is not clear why the authors chose such design.

    2. The explanation of the frequency enhancement Transformer block and the efficient enhancement multi-scale bridge is not clear. There should be more mathematical equations explaining them, as only figures and texts are not enough.

    3. How the authors used up the 8 pages is questionable. The parts for the frequency enhancement Transformer block and the efficient enhancement multi-scale bridge need more details while the introduction is way too long. Besides, equation 1 and 2 are not necessary, as they are more or less common knowledge in the MICCAI community.

    4. The argument for using the frequency enhancement Transformer block to preserve high-frequency information would be stronger if the authors can show some attention maps learned using it.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The datasets used in the paper are public and the code will be made available although there might not be enough details in the paper to re-implement the method.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. It would be nice if the authors can reorganize the paper so that the introduction can be less lengthy, there are fewer unnecessary equations (namely equation 1 and 2), and more details and equations can be included for the frequency enhancement Transformer block and the efficient enhancement multi-scale bridge.

    2. Some simple experiments showing how the frequency enhancement Transformer block can help preserve high-frequency information would be desirable.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The major issue with this paper is that the authors focused too much on the background introduction and background knowledge while not providing enough details in the proposed method. However, the motivation of the paper is good and they proposed an interesting and innovative idea to solve to problem.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    Authors propose “Laplacian formers”, a new Vision Transformer model with dual attention mechanism to detect local textures and improve segmentation. The efficient attention mechanism proposed is EF-ATT. EF-ATT is constituted by an efficient attention module to reduce the complexity of self attention (to linear), and a frequency attention module to emphasize each frequency information by a Laplacian pyramid followed by a fusion strategy, and they add a bridge between the encoder and decoder to transfer spatial information by keeping essential features. They compare their Laplacian former with SOTA. They validate their approach on ISIC 2018 (skin lesion) and Synapse dataset (CT scans).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Originality: The authors propose an approach based on efficient attention mechanisms and an original bridge which preserves fundamental features. Validation: The Authors validate their results by comparing their algorithm with 8 other existing architectures. Extendability: This work is extendable to many biomedical datasets in CT (appart the kidney), and for skin lesions.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limitations in the clarity. The equations inside the figures are too small. Samples in the last figure should be discussed with the results in tables, or other samples have to be chosen. Some errors should be corrected.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    In the Methods part, authors mention four patch-expanding blocks in the decoder part, in the figure 1 three patch-expanding blocks are possible and visible. Authors should correct this error or take into account this fourth step by updating all the figure 1.

    Authors should add a legend to understand the acronyms or give all the acronyms inside the text. For instance LN means LayerNorm, etc.

    In equation 1) d is not explained in the text. Authors should mention in the text what d represents. In the literature the square root of d_k can also be found.

    In the text referencing equation 3, if l+1 is chosen, the authors should keep the same denomination everywhere in the manuscript (replacing L+1). Is there a difference between L and l?

    In Figure 1, 2 and 3, some legends are unreadable until doubling the size of the figure. The authors should increase the size of some elements in the figures (equations in figure 1 and 2, and anatomical legends in figure 3).

    In Figure 4 the bottom of x-axis labels are erased.

    In the sample shown, second line, the kidneys are not well segmented while the scores are indicating the Authors methodologies as the best or second. Could it be possible to show in annexe other samples or to discuss the scores with the sample representation in this figure for this anatomical structure?

    For future work, Authors should mention on, how many launch scores have been calculated on each sample and computation time.

    For the annexes, and future works, the boundaries are not visible. To clarify, on the Ground truth, Authors may superimpose the boundaries predictions and highlight the zone (and not boundaries) not well predicted (False positive and False negative zones).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The weaknesses are based upon clarity and size of some elements inside figures. Some errors needs to be corrected but do not alter the contribution.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors proposed a “Laplacian formers” to detect local textures and improve the medical image segmentation. The reviewers all agree to (weak) accept the paper.

    One concern of this paper is the clarity, for example, the motivation of the efficient enhancement multi-scale bridge. Also, the authors should better prove “efficient” using experiments instead of just O(n^2) to O(d^2*n), please note in the higher layers of vision transformer, the d may be even bigger than n, and thus the real efficiency may become worse. The title and abstract mention “local texture”, the experimental section should be able to correspond to the claim.




Author Feedback

We would like to thank all of the reviewers and the meta-reviewer for their constructive comments. We are committed to thoroughly examining and thoughtfully incorporating all of their valuable suggestions into the final version of our paper, to the best of our abilities.



back to top