Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Ahmed Taha, Yen Nhi Truong Vu, Brent Mombourquette, Thomas Paul Matthews, Jason Su, Sadanand Singh

Abstract

Medical images come in high resolutions. A high resolution is vital for finding malignant tissues at an early stage. Yet, this resolution presents a challenge in terms of modeling long range dependencies. Shallow transformers eliminate this problem, but they suffer from quadratic complexity. In this paper, we tackle this complexity by leveraging a linear self-attention approximation. Through this approximation, we propose an efficient vision model called HCT that stands for High resolution Convolutional Transformer. HCT brings transformers’ merits to high resolution images at a significantly lower cost. We evaluate HCT using a high resolution mammography dataset. HCT is significantly superior to its CNN counterpart. Furthermore, we demonstrate HCT’s fitness for medical images by evaluating its effective receptive field. Code available at https://bit.ly/3ykBhhf

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_3

SharedIt: https://rdcu.be/cVRsP

Link to the code repository

https://github.com/whiterabbit-ai/hct

Link to the dataset(s)

N/A

Reviews

Review #2

Please describe the contribution of the paper

This paper proposes a high resolution transformer based model HCT, in particular the attention convolution (AC) block.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed HCT is a convolutional transformer for high resolution inputs. With linear attention approximation, HCT seems to improve over GMIC, a benchmark for high resolution mammography.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

This paper only compares with one existing approach, namely GMIC. Also, the proposed method can be seen as a modification to GMIC.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No code is given; the description seems to be clear enough to reproduce the work.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The proposed method seems to be highly related to GMIC. Adding more comparison with other methods can strengthen the results.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes an efficient transformer model to process the high resolution mammograms. The results suggest improvements over previous approaches.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #3

Please describe the contribution of the paper

Transformer has limited application in analyzing large-size medical images due to its huge computation requirements of the self-attention block. This work adopts the recent proposed Performer with reduced computing consumption to power the shallow layer of the neural network, providing a possible way to explore efficient learning for other research.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

this work adopts the latest transformer technique Performer (Transformer architectures which can estimate regular full-rank-attention Transformers with provable accuracy, but using only linear space and time complexity, without relying on any priors such as sparsity or low-rankness), and demonstrates the feasibility of applying the Performer on the large-scale medical input.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

this work fails to clearly explain why the effective attention is suitable the high-resolution input. Meanwhile, the title fails to reflect the entire submission and misleads readers in the architecture design.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility of this paper is good as the author used a public implementation and evaluated their method on the public dataset.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
This work adopts Performer equipped with different settings to detect the tumor region from the high resolution mammograph. However, many basic concepts are wrongly addressed, making the submission inconvincible.
1. The ‘High resolution’ concept has been disclaimed overall the submission. High resolution refers to the fine unit spatial distance instead of a large number of pixels.
2. The attention mechanism aims at dynamically searching out similar regions from the images, which is invariant for its variants. Then, the concept “HCT focuses dynamically on regions based on their contents and not their spatial positions” misleads the readers and should be corrected.
3. Since the authors argued that the AC is the contribution, please clarity the difference between the AC block and attention block in Performer [1].
[1] Choromanski, Krzysztof, et al. “Rethinking attention with performers.” arXiv preprint arXiv:2009.14794 (2020).
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Unlike the transformer limited by the GPU memory, the Performer can be deployed in the shadow layer due to the full-rank computing strategy. Taking advantage of Performer, this work validates the feasibility of Performer in analyzing high-resolution mammography. Despite of the improved performance, this work lacks the dedicated comparison between the deep and shallow network, which is emphasized on the title.
Number of papers in your stack

6
What is the ranking of this paper in your review stack?

4
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

Although the author made a significant effort in clear their contribution and technical novelty, the novelty is not enough for MICCAI even if they did fit Transfomer to the HIGHRES medical images.

Review #5

Please describe the contribution of the paper

In this paper, the authors address this complexity of high-pixel medical image recognition by exploiting a linear self-attention approximation. With this approximation, the authors propose an efficient vision model called HCT, which stands for High Resolution Convolutional Transformer. Extensive case studies validate the effectiveness of the proposed HCT.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The problem studied in this paper is important and needs to be solved in Medical Imaging
- Extensive case studies
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
Weaknesses:
- Need more justifications about the novelty claims
- Need to include more related work that are highly important
- Need to check for grammatical errors and typos.
- The evaluation needs to be enhanced in terms of baselines, datasets, settings, etc.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The dataset and code are not disclosed in this paper, so the reproducibility of this paper needs to be further improved.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
Comments:
1. Need to include more related work that are highly important [1] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 3-19. [2] Kitaev N, Kaiser L, Levskaya A. Reformer: The Efficient Transformer[C]//International Conference on Learning Representations. 2019.
  - The authors need to introduce some related work on the efficient attention mechanism design.
2. Need more justifications about the novelty claims
  - The proposed Attention-Convolutional (AC) block of this paper is too close to the Convolutional Block Attention Module in Ref [2]. Therefore, the authors need to provide more justifications about the novelty claims.
  - This paper lacks a theoretical analysis and a complexity analysis of the proposed efficient attention mechanism.
  - The following references are hoped to be helpful for the authors to further improve the quality of this paper. Although Ref. [3] is only recently published, it is still easy for authors to refer to and follow up. [3] Zheng, Lin, Chong Wang, and Lingpeng Kong. “Linear Complexity Randomized Self-attention Mechanism.” arXiv preprint arXiv:2204.04667 (2022).
3. The evaluation needs to be enhanced in terms of baselines, datasets, settings, etc.
  - It would be better if the authors could add more benchmark datasets to verify the effectiveness of the proposed model.
  - The authors need to add more advanced baselines such as work based on an efficient attention mechanism (e.g., ref [2]) to further validate the effectiveness of the proposed model.
4. Need to check for grammatical errors and typos
  - The picture in table 1 is blurry, please provide a clearer version.
  - The editorial quality of this paper is largely unsatisfactory. It contains quite a lot of inconsistent/non-precise description, as also reflected in the above comments.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The experimental part of this paper is insufficient to illustrate the superiority of the proposed method in terms of experimental setup, baselines, and datasets. In addition, the novelty of this paper still needs to be improved.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

The reviewers maintain the previous decision unchanged. The main concern is the dataset in the experimental part. Although the authors argue that the dataset used is large, using a large dataset to validate the proposed model does not justify the generality of the model. Therefore, the reviewers still recommend that the authors add more benchmark datasets to better evaluate the proposed model.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
Summary & Contribution: This paper proposes an efficient vision model called High Resolution Convolutional Transformer (HCT), which is a high resolution transformer based model that adopts the recent proposed Performer to reduce the computational cost by leveraging a liner self-attention approximation. Therefore, the main motivation to propose this framework is to reduce computational cost. HCT is evaluated on a mammography dataset showing improvements when compared to the equivalent CNN.

The main contribution of this paper is a framework that exploits transformers and high resolution images at a significantly lower cost.

Key strengths:
- The authors propose an efficient transformer-based model that is superior on a high-resolution dataset
- Evaluation on a large mammography dataset
- Good comparison study which includes GMIC, an established benchmark, and strong ablation study
Key weaknesses:
- It is not fully clear the technical novelty and contribution of the work, as HCT seems to be a combination of GMIC and Performer/Nystromformer, and the difference between the AC block and attention block in Performer is not clear.
- There is no evidence for some of the claims made in the paper (e.g., ‘identifying local structures only needs local information’, ‘a huge number of layers is needed’ – also, what is huge?, ‘HCT ‘s ERF is flexible and agile, etc.)
- Literature review does not include highly relevant work
- Ideally the evaluation study should be extended to include more datasets or baselines to claim that the proposed approach is superior
- Statistical tests details missing
- Mathematical details missing, for instance in sec 2.2, x and y seem to not be defined
Evaluation & Justification: Reviewers agree that the use of an efficient transformer model to process high resolution mammograms is interesting and the results suggest an improvement when compared to previous models. However, there are concerns regarding the experimental results, as they are insufficient to claim the superiority of the proposed method. The real novelty of the method is also not fully clear. Conclusions may be vague and over claimed, for instance, experiments in a single dataset with unclear clinical relevance in the difference and improvement; how this can conclude that HCT is superior to CNN?

If a rebuttal is submitted, please clarify the main novelty and contributions of this work. Please consider all comments and suggestions from all reviewers. Ideally, the comparison study should be extended and/or further discussed, and statistical tests details and mathematical formulations clarified and/or better defined.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

3

Author Feedback

We thank the reviewers for their time and feedback. [Dataset] Evaluation on a single dataset is indeed a limitation. Fortunately, our dataset is large enough to compute statistical significance. Our performance is above the 95% CI of the baseline. Also, we ablate our model to identify the source of improvement: (1) Tab. 1 uses 300 random images to show how our effective receptive field (ERF) is superior to the baseline model. Our ERF is less concentrated around the center and spreads dynamically without explicit supervision; (2) In the supp. material, we present multiple in-depth studies. Tab. A1 shows our model is superior on malignant tissues that require a large receptive field (e.g., architectural distortion). Fig. A3 stratifies our dataset by malignant finding size. In this figure, our model is superior on images with large finding-size. Please note that our supp. material follows MICCAI guidelines*, i.e., ‘[No] text materials beyond figure and table captions.’

[Reference] We cited key relevant work given MICCAI guidelines, i.e., ‘up to 2 pages of references.’ R5 mentioned two references arXiv:2204.04667 and Reformer. [arXiv:2204] is published two months *after the submission deadline. We should not be penalized for this. The Reformer paper is already cited [15]; Reformer limitation is mentioned on page 5.

[Novelty/Contribution] Most MICCAI literature resizes images to 512x512 to avoid the computational cost of high resolution images during training. We propose an attention-convolution (AC) building block to train on high resolution images (3328×2560) with batchsize bz=32 on a single GPU. Fig. 5 shows a significant performance boost (+4% AUC) when training on high resolution images. -The main difference between our AC block and a standard attention block like in Performer is the down-sampling capability (Fig. 1). This is essential when working with 2D/3D images as pointed out in Sec. 2.1. Our AC block is not GMIC-specific. It is a building block such as BasicBlock and BottleNeck. These blocks are originally proposed for ResNet, but they are used in other custom models. -The simplicity of our proposal is not a weakness, but a strength. This simplicity makes it easy to replicate our work as pointed by R2 and R3.

[Deep Architecture] Training a deep network (e.g., ResNet50) on high resolution images (3328×2560) is not feasible. On a 48GB GPU, we can only fit a batchsize bz=2 for ResNet50. To achieve our model’s bz=32, we would need 16 GPUs. In addition, ResNet50’s theoretical RF is \approx 480x480 « 3328×2560. This is why “Deep is a Luxury We Don’t Have”.

[Misc] -‘a huge number of layers is needed’: The theoretical RF (TRF) of ResNet50 is \approx 480. Its ERF is even lower according to Luo et al [19]. [19] shows that the ERF/TRF ratio is \approx 38% when the number of layers is 50 ([19] Fig. 2 right). Thus, to achieve an ERF closer to a 3328x2560 image with pure convolutions, we would need a huge number of layers as the ERF grows with the square-root of the number of layers. This is impractical as we can only fit a bz=2 with ResNet50. -‘HCT ‘s ERF is flexible and agile’: In a baseline model, the ERF follows a Gaussian distribution independent of the input. In contrast, our model’s ERF follows a one-sided heavy-tailed distribution, i.e., dependent on the input; hence more flexible and agile. We can change this to ‘dynamic’ if the meta reviewer prefers. -Mathematical details: Sec. 2.2 states x,y \in R^d. Concretely, d is the number of channels — in conv features — as stated in the Technical Details (Sec. 3). -Statistical details: The 95% CI of a model’s AUC is constructed via percentile bootstrapping with 10,000 bootstrap iterations.

*https://conferences.miccai.org/2022/en/PAPER-SUBMISSION-AND-REBUTTAL-GUIDELINES.html We believe the MICCAI community would be interested in our submission. It tackles both an important and relevant problem, i.e., processing high resolution images.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The main contribution of this work is a framework that exploits transformers and high-resolution images at a significantly lower cost. Authors also propose the Attention-Convolutional (AC) block.

Key strengths:
- The authors propose an efficient transformer-based model that is superior on a high-resolution dataset
- Evaluation on a large mammography dataset
- Good comparison study which includes GMIC, an established benchmark, and strong ablation study
Key weaknesses:
- Ideally the evaluation study should be extended to include more datasets or baselines to better support the claims made by the authors.
Review comments & Scores: There are still some concerns that claims of the paper are based on experiments on a single dataset, as experiments do not demonstrate the generality of the model.

Rebuttal: Authors have also addressed successfully my main concerns. Although novelty may still be considered limited on some aspects, all reviewers agreed to accept this work after rebuttal because it is considered an interesting approach.

Evaluation & Justification: Although evaluation of the method would benefit from an extended analysis including more benchmark datasets to show the advantages of the proposed method, reviewers agree that an efficient transformer model to process high resolution images such as mammograms is interesting and may be of interest of the MICCAI community. Authors have shown that using their method, and the full high-resolution images, results are improved (+4% AUC) and this could be applied to other areas.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

9

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The main paper strengths are: an efficient transformer model that works well on high-resolution mammograms, compelling experiments with a good comparison with GMIC and strong ablation study. The main weaknesses are: unclear novelty, unclear claims, incomplete literature review, limited number of datasets in the evaluation, missing statistical tests, missing mathematical details. The rebuttal justifies the limited datasets, partially justifies the incomplete literature review (missing review [1] by R5), partially clarifies novelty, clarifies claim issue, introduces statistical tests and maths details. I believe this is an interesting paper for the community and the rebuttal helped address the issues, so I recommend acceptance.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

3

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The novelty of this paper is unclear and the experiments seem to be not sufficient. I am also wondering why “Deep is a Luxury We Don’t Have”? The application scenario does not require real-time performance at all, and the question in the title isn’t really answered in the paper. Reproducibility is poor and not all claims are supported by experimental evidence. The rebuttal does not really answer the question about novelty and mainly focuses on technicalities regarding guidelines.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

NR

back to top

Deep is a Luxury We Don't Have