Authors

Ken C. L. Wong, Hongzhi Wang, Tanveer Syeda-Mahmood

Abstract

With the introduction of Transformers, different attention-based models have been proposed for image segmentation with promising results. Although self-attention allows capturing of long-range dependencies, it suffers from a quadratic complexity in the image size especially in 3D. To avoid the out-of-memory error during training, input size reduction is usually required for 3D segmentation, but the accuracy can be suboptimal when the trained models are applied on the original image size. To address this limitation, inspired by the Fourier neural operator (FNO), we introduce the HartleyMHA model which is robust to training image resolution with efficient self-attention. FNO is a deep learning framework for learning mappings between functions in partial differential equations, which has the appealing properties of zero-shot super-resolution and global receptive field. We modify the FNO by using the Hartley transform with shared parameters to reduce the model size by orders of magnitude, and this allows us to further apply self-attention in the frequency domain for more expressive high-order feature combination with improved efficiency. When tested on the BraTS’19 dataset, it achieved superior robustness to training image resolution than other tested models with less than 1% of their model parameters.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_35

SharedIt: https://rdcu.be/dnwDJ

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #2

Please describe the contribution of the paper

The authors introduce the HartleyMHA model and by modifying the Fourier Neural Operator with the Hartley transform, they achieve improved efficiency and robustness to training segmentation models with lower resolution inputs.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors introduce the HartleyMHA model and by modifying the Fourier Neural Operator with the Hartley transform, they achieve improved efficiency and robustness to training segmentation models with lower resolution inputs.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1) As shown in table 1 and text, HNOSeg and HarleyMHA has much fewer (less than 1%) parameters than V-Net-DS and UTNet. This should be able to translate into much faster training for the smaller models. However, this submission doesn’t include any training time comparisons. Inference time comparison, as shown in Table 2, fail to show any advantage of using smaller models.

2) It is a surprise that HNOSeg and HarleyMHA, with much fewer parameters, have similar inference time with V-Net-DS and UTNet (table 2).

3) The comparison of V-Net-DS/UTNet vs. FNO/HNOSeg/HarleyMHA in low resolution input (Table 1 right column) might not be fair. The superior performance from the latter three may mostly come from their inherent super-resolution capability. If V-net-DS/UTNet were to incorporate with a super resolution postprocessing step, their Dice and HD 95 may show significant improvements.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors’ report on reproducibility appears to be reasonable.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

1) Training performance of the proposed smaller models (HNOSeg/HarleyMHA) vs. large models (V-Net-DS/UTNet) should be provided.

2) Inference time comparison, as shown in Table 2, fail to show any advantage of using smaller models, which is a surprise.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The design of HartleyMHA model, to my knowledge, is novel, which leads to much smaller segmentation networks. However, the advantage of smaller models, in terms of training and inference time, is not demonstrated in this submission.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

In this paper, the authors focus on reducing the number of learnable parameters of transformer-based segmentation networks without substantially decreasing their segmentation accuracy. This is a challenging task of crucial importance especially for the medical community that calls for efficient and reliable 3D segmentation approaches. The proposed Hartley-transform-based approach with frequency-domain-driven self-attentation allows for reducing the number of learnable parameters by several orders while achieveing comparable segmentation performance over the BraTS19 dataset as the evaluated baseline alternatives.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This is a technically sound, well-written paper that introduces a novel use of the Hartley transform for segmenting 3D medical image data. The reported results are convincing and clearly demonstrate the purpose of the proposed approach.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The proposed approach has been evaluated using a single dataset only. Several hyperparameters have been set in an ad-hoc fashion without any quantitative evidence. The limitations and the situations when the proposed approach fails have not been discussed.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

-
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- There are several hyperparameters, such as k_max and the number of epochs, fixed at ad-hoc values without demonstrating how robust the proposed approach is to their choices.
- Despite convincing segmentation results reported for the BraTS19 dataset, it would be beneficial to see how well the proposed approach can segment targets in medical image data of other modalities.
- The reason for preferring the Pearson’s correlation coeffcient loss to the Dice and weighted cross-entropy losses needs to be quantitatively documented.
- The text size used in Fig. 1 is too small, which makes the figure content barely readable in a printed version of paper.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is a well-written, technically sound conference paper with convicing, yet limited results reported, which introduces a novel use of the Hartley transform for reducing the number of learnable parameters without substantially affecting the overall segmentation performance.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper proposed a reduced sized transformer model to alleviate out-of-memory errors by using Transformer based 3D segmentation models. Inspired by Fourier neural operator deep learning framework, they used reduced size model and multi-headed attention model operation in the frequency domain. Dice scores and 95% Hausdorff distance results highlight the efficacy of proposed models when dealing with downsampled training images and high-resolution images during inference, compared to SOTA segmentation methods that fail in case mentioned above.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors proposed two models architecture including HartleyMHA and HNOSeg. Both of these architecture models share reduced parameter size and self-attention mechanisms. HarthleyMHA benefits of multi-headed attention mechanism, allowing in theory of long-range dependencies, based on low-frequencies inputs. Experimental results showed the that proposed approach is competitive with SOTA approaches, i.e. V-Net_DS and UTNet, FNO.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Although the proposed seems performed well, experimental results showed baseline results on original training image size shows that v-Net has better performance. Inference time also did not vary much despite the reduced size model of the HartleyMHA and HNOSeg. Justification of the use of transformers by leveraging their long-range dependencies should be more exploited by examining other 3D segmentation datasets (medical image decathlon for example). The GPU memory size gains are relatively small when compared to orders of magnitude model size reduction when using newly proposed architectures, only a factor of 2 reduction of memory size reduction. In general, I think the contribution of this work might not be enough to argue the proposed method is clinically more practical than the baseline methods.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I noted the pre-trained weights and training/evaluation code and training scheme clearly explained. However, without architecture code implementation, it might not be an easy task to replicate.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
I suggested the authors to include the following content in the rebuttle or final submission if paper has been accepted:
1. memory requirements during training for different images
2. standard deviation of evaluation metrics
3. clarify the benefit of the proposed method in clinical scenario
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The idea of proposing a reduced sized transformer model is an interesting idea, and the experimental results did show the robustness to different image resolutions. Although this paper showed the proposed approach is competitive with SOTA approaches (i.e., V-Net_DS and UTNet, FNO). It is not clear the benefit in the clinical scenario.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper is well written and technically good. HarleyMHA and HNOSeg are introduced, achieving improved efficiency and robustness in comparison with the state of the art. It would be useful to justify the use of a single dataset for comparison, or include more experimental results if they exist. The hyperparameters should be justified and the training times and memory requirements should be discussed in more detail. The real-world impact should also be discussed.

Author Feedback

N/A

back to top

HartleyMHA: Self-Attention in Frequency Domain for Resolution-Robust and Parameter-Efficient 3D Image Segmentation