Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Omkar Thawakar, Rao Muhammad Anwer, Jorma Laaksonen, Orly Reiner, Mubarak Shah, Fahad Shahbaz Khan

Abstract

Accurate 3D mitochondria instance segmentation in electron microscopy (EM) is a challenging problem and serves as a prerequisite to empirically analyze their distributions and morphology. Most existing approaches employ 3D convolutions to obtain representative features. However, these convolution-based approaches struggle to effectively capture long-range dependencies in the volume mitochondria data, due to their limited local receptive field. To address this, we propose a hybrid encoder-decoder framework based on a split spatio-temporal attention module that efficiently computes spatial and temporal self-attentions in parallel, which are later fused through a deformable convolution. Further, we introduce a semantic foreground-background adversarial loss during training that aids in delineating the region of mitochondria instances from the background clutter. Our extensive experiments on three benchmarks, Lucchi, MitoEM-R and MitoEM-H, reveal the benefits of the proposed contributions achieving state-of-the-art results on all three datasets. Our code and pretrained models will be publicly released.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43993-3_59

SharedIt: https://rdcu.be/dnwN4

Link to the code repository

https://github.com/OmkarThawakar/STT-UNET

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper addresses the task of instance segmentation in 3D electron microscopy volumes and proposes a new hybrid CNN-Transformer based on an axially split (“spatio-temporal”) self-attention mechanism, a specific deformable fusing operation, and a PatchGAN loss function. They evaluate their method on three EM datasets, where they achieve better performance than the current SOTA.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • solid idea, good experiments
    • beats SOTA by a large margin
    • well reported results
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • contribution is unclear and related work discussion is inadequate: “Spatio-temporal” attention is a very well established mechanism (“Axial attention”/”Criss-cross attention”)
    • manuscript lacks clarity in certain pars (see below)
    • Unclear evaluation splits (results appear to be too good to be true)
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    seems ok

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The experiments of paper appear to have been done solidly and the results are clearly impressive. However a big concern is the lack of discussion of related work to one of the core claimed contribution: “Spatio-temporal” attention (i.e. splitting attention across dimensions) is a well known mechanism (“axial attention”) and there are many papers, e.g (all cited 300+):

    Huang, Zilong, et al. “Ccnet: Criss-cross attention for semantic segmentation.” CVPR 2019 Ho et al. “Axial Attention in Multidimensional Transformers.” 2019 Wang et al. “Axial-deeplab: Stand-alone axial-attention for panoptic segmentation.” ECCV 2020. Bertasius et al. “Is space-time attention all you need for video understanding?.” ICML 2021 Yan et al. “After-unet: Axial fusion transformer unet for medical image segmentation.” WACV. 2022

    None of those are cited which is clearly inadequate.

    • The same goes for the FG/BG adversarial loss, which again is standard and simply the PatchGAN loss from [8].

    • Finally, it is very surprising that even without any postprocessing (apart from connected components) the method achieves e.g. 0.958 vs 0.917 next best on MitoEM-R (a huge improvement). Is that on the validation data? (note that [28] is on the test data). And is that MitoEM v1 or MitoEM-v2? And were the results to simply uploaded to the MitoEM challenge? Please clarify.

    Minor:

    • “The denoising is performed by convolving the cur- rent frame with two adjacent frames using predicted kernels, thereby generating the resultant frame by adding the convolution outputs” -> I don’t understand that sentence. What is the “denoising” module?
    • Why did you call it “spatio-temporal”? there is not time in EM. Its very confusing.
    • Was the final result obtained via tiled prediction? Which tilesize?
    • What positional embeddings were used?
    • what is the runtime ?
    • table 5 is unclear (what are the differences?)
    • explain Eq (3) (e.g. what is the integral?)
    • typo “BioRxiv”
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good demonstration, good results. Needs to address the concerns outlined above however.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This manuscript presents a method for segmenting mitochondria instances from EM data. The authors build their method on top of the existing one, with the two main additions in this submission being spatio-temporal attention mechanism and adversarial loss. The main downside of this work is validation as, for the larger of the two data sets, the authors compare their results obtained on the validation set to the ones obtained on the independent test set.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Two novel components that improve performance

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Reporting results on the Mito-EM validation set, whereas the benchmark results were obtained on the test set . • Description of the methodology needs to be improved. • The manuscript needs a language check.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This paper is reproducible

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. For the Mito-EM data set, the authors validate their method on the challenge data and use the corresponding challenge results as a benchmark. However, this comparison is unfair, as results of both the proposed method and the earlier method on which it is based are obtained on the validation set. Whereas all the methods that participated in the challenge were evaluated on the independent test set as the corresponding publication clearly states. Even though the organizers do not provide the test set images, they offer possibility of evaluation of submitted results and comparing to the current leaderboard.
    2. The authors introduce two modification to the baseline method. While both of them are not novel as such, they have not been used for this particular task. Moreover, they suggest using a spatio-temporal attention mechanism, which they end up decoupling into two separate attention mechanisms. The idea of treating this particular 3D image stack as 2D+time data, even though the authors did not mention it explicitly, is interesting as the images are correlated. From that perspective it would be interesting, for completeness, to extend Table 5 with an experiment in which only the spatial attention is used.
    3. The manuscript, especially the Supplementary Material, needs a language check.
    4. Section 2.1 needs considerable editing: a. Variables Q, K, V, R need to be described b. Phrase “T is volume size” is unclear and needs to be rephrased c. Phrase “d_k is dimension of Q_s” is unclear and needs to be rephrased d. It is necessary to mention why the Q_t, K_t and V_t maps need to be permuted.
    5. Page 8: “We also evaluate with different input volumes: 4,8,16,32. We observe best results are obtained when using 32 input volume.” I was not able to understand what is meant here. This part needs to be rewritten.
    6. Supplementary material in its current form is somewhat separated from the main document as the latter does not contain any references to the former. This especially holds for the Table S3, which I was not able to interpret.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Interesting work that needs a better and more objective validation.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    In my opinion, objective validation of the proposed method remains a major concern with regard to this submission. In their rebuttal the authors have clarified that (1) their primary goal was to compare their method to the baseline, and (2) upon request, they validated their method on the test set. However, comparing the method to the baseline is clearly not sufficient for getting an objective idea about its performance; it also needs to be compared to state-of-the-art, on the same data. I would have also liked to see more detail on the validation on the test set since these data are not public and this step needs assistance of the challenge organizers in some kind or form.



Review #3

  • Please describe the contribution of the paper

    The paper presents a novel hybrid CNN-transformers based encoder-decoder framework for accurate 3D mitochondria instance segmentation in electron microscopy. The key contribution is the introduction of a split spatio-temporal attention module that is claimed to efficiently capture long-range dependencies by computing spatial and temporal self-attentions in parallel. Additionally, the paper introduces a semantic foreground-background adversarial loss during training, which is claimed to improve the delineation of mitochondria instances from background clutter. The proposed approach achieves state-of-the-art segmentation performance on three benchmark datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strengths of the paper are:

    1. The introduction of a novel spatio-temporal split attention module, which reduces the burden of large memory requirements.

    2. The solid results achieved by the proposed method, as it outperforms the baseline models by a significant margin, demonstrating its effectiveness in 3D mitochondria instance segmentation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The major claims of the paper are primarily presented through qualitative descriptions, lacking concrete evidence to support their assertions. This raises concerns about the validity and impact of the proposed methodology.

    1. The authors propose the split spatio-temporal (SST) attention module, claiming that it captures long-range dependencies more effectively than traditional convolutions. However, no concrete evidence or qualitative analysis is provided to support this assertion, making it difficult for the reader to evaluate the true impact of this innovation.

    2. Similarly, the authors propose a semantic foreground-background (FG-BG) adversarial loss, asserting that it helps to accurately delineate mitochondria instances from the cluttered background. However, no empirical evidence or comparisons are provided to support this qualitative claim.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have used publicly available benchmark datasets for their experiments, which is a positive aspect for the reproducibility of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    In Fig. 1, the images are presented at a small scale, making it difficult for the reader to discern details and fully appreciate the segmentation results. Furthermore, the figure displays an excessive number of mitochondria instances, which can lead to confusion and uncertainty regarding which specific elements the reader should focus on. To make the figure more informative, consider increasing the size of the images for better visibility and reducing the number of mitochondria instances shown.

    In Fig. 2(c), the input volume and the masks are concatenated, but it was incorrectly represented with “addition” symbols.

    Eq. (4) is quite unclear and disorganized. Please refine it.

    Fig. 3 is not informative enough, primarily due to the lack of ground truth and baseline results for context and comparison. Additionally, the small image size and the excessive number of mitochondria displayed make it difficult for the reader to discern relevant information.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. Solid results: The paper presents strong results that demonstrate the effectiveness of the proposed approach, as it outperforms existing baseline methods by a significant margin. This indicates the potential impact of the method in the field of 3D mitochondria instance segmentation.

    2. Unsupported qualitative claims: Despite the solid results, some of the major claims are only qualitatively described and lack evidence to substantiate them. This includes the effectiveness of the split spatio-temporal (SST) attention module in capturing long-range dependencies and the benefits of the semantic foreground-background (FG-BG) adversarial loss in delineating mitochondria instances from cluttered backgrounds. Providing additional evidence would strengthen the paper and make the claims more convincing.

    Addressing the concerns related to unsupported qualitative claims would enhance the paper’s overall quality and impact.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper presents a novel hybrid CNN-transformer architecture designed for the segmentation of mitochondria structures in electron microscopy images. The proposed approach incorporates a spatio-temporal attention mechanism and leverages an adversarial loss. The experimental results demonstrate significant improvements over the baseline models. However, it is crucial to acknowledge the existing criticism regarding the novelty of the method. Additionally, there are concerns raised about the validation process, emphasizing the need for testing on the leaderboard. Furthermore, an (empirical) justification of the effectiveness of the proposed idea should be provided. These critical points must be thoroughly addressed in the rebuttal.




Author Feedback

We thank reviewers (R1, R2, R3) for positive feedback: solid idea, good results (R1), two novel components with improved results (R2), novel module with solid results (R3). Our code and models will be made publicly available.

R1:

On Contributions: While spatio-temporal self-attention has been recently explored in other tasks, to the best of our knowledge we are the first to propose a transformers-based framework with split spatio-temporal attention for the problem of 3D mitochondria instance segmentation. Prior works in 3D mitochondria segmentation typically rely on 3D CNNS to encode instance information across mitochondria (EM) volume and therefore struggle to model global contextual dependencies that extend beyond designated receptive field. Our approach sets a new SOTA on two benchmarks (acknowledged by R1).

Comparison with other attention methods: We thank R1 and will include all suggested references in our related work discussion. Compared to divided space-time attention (Bertasius et al ICML’21) and axial attention (Ho et al. Arxiv’19), our approach achieves favorable results with gain of 0.9% and 1.1%, respectively likely due to computing spatial and temporal in parallel and later fusing them through a deformable convolution.

On PatchGAN loss [8]: While [8] uses the loss for a generative task, the input to the semantic adversarial loss in our approach is predicted mask and gt mask along with input image, where the loss is used to accurately delineate the region of mitochondria instances from cluttered background.

On Results: Similar to recent SOTA work [12], we also report results on the standard validation splits provided by MitoEM and Lucchi to have a fair comparison to [12]. Since MitoEM v1 leaderboard is replaced by MitoEM-v2, we also compare our approach with [12] on MitoEM-v2 test set achieving a gain of 4.1% on MitoEM-R. We thank R1 and will add these in revised draft.

Minor issues: (a) To alleviate effect of noise added in MitoEM sample due to EM, we used information of neighboring frames to reconstruct the noisy region. (b) We name it due to volumetric nature of data and will clarify it in revised draft. (c) We follow [12] and use same tile size [32,320,320]. (d) Standard 3D positional embedding is used. (e) Training time is 30 hrs using 2 MI250X GPUs. Inference time is 30 mins. (f) Tab. 5 represents ablations for design choice of spatial and temporal attention. (f) Integral in Eq.3 refers to deformable convolution across channels C.

R2:

On test results: Similar to recent SOTA work [12], we also report results on the standard validation splits provided by MitoEM and Lucchi to have a fair comparison to [12]. Since MitoEM v1 leaderboard is replaced by MitoEM-v2, we also compare our method with [12] on MitoEM-v2 test set achieving a gain of 4.1% on MitoEM-R. We will add results in revised draft.

Results with only spatial attention: As recommended, we perform experiment and observe our method to achieve gain of 4.8% over spatial attention only. We will add it in revised draft.

Language check and paper polishing: We thank R2 and will rectify them in revised draft.

On input volume in page 8: It refers to # of consecutive frames.

R3:

Impact of our approach: We perform extensive quantitative experiments (Tab. 1–Tab. 5 in paper), outperforming recent work [12]. We also present a quantitative comparison in Tab 3 of supplementary on noisy data.

Comparison with other attention: We perform such a comparison in Tab. 5 in paper. In addition, please also see response to R1 where we discuss that our method also outperforms divided space-time (ICML’21) and axial attention (Arxiv’19). Our method also achieves a gain of 4.8% over spatial attention alone.

Impact of FG-BG loss: we show impact of FG-BG loss in Tab. 3 by integrating our contributions one at a time. Introducing FG-BG loss leads to an absolute gain of 1%.

Improving Fig 1 and 3: We thank R3 and will improve them in revised draft.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    In the rebuttal, the authors provided specific responses addressing the concerns raised by the reviewers regarding the contribution of spatio-temporal self-attention and insufficient validation. While R2 still has some remaining concerns regarding the validation, the paper exhibits merits, particularly in its ability to solve a specific problem: mitochondria segmentation in EM data, using a relatively novel method. Hence, I recommend accepting this paper.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal has provided some explanations regarding the importance of the work, ablation studies on some modules, and clarification on the motivations. The proposed method has novelty, and the performance is promising. The final version should include more comparison studies with the state-of-the-art methods on the same dataset and more detailed explanations of the dataset.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Two reviewers are positve and one reviewer is negative. After reading the rebuttal, R2 does not change the rating score. In my opinion, the rebuttal has well addressed main concerns of reviewers. Hence, I think this work can be accepted.



back to top