Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Marzieh Oghbaie, Teresa Araújo, Taha Emre, Ursula Schmidt-Erfurth, Hrvoje Bogunović

Abstract

The automatic classification of 3D medical data is memory-intensive. Also, variations in the number of slices between samples is common. Naïve solutions such as subsampling can solve these problems, but at the cost of potentially eliminating relevant diagnosis information. Transformers have shown promising performance for sequential data analysis. However, their application for long sequences is data, computationally, and memory demanding. In this paper, we propose an end-to-end Transformer-based framework that allows to classify volumetric data of variable length in an efficient fashion. Particularly, by randomizing the input volume-wise resolution(#slices) during training, we enhance the capacity of the learnable positional embedding assigned to each volume slice. Consequently, the accumulated positional information in each positional embedding can be generalized to the neighbouring slices, even for high-resolution volumes at the test time. By doing so, the model will be more robust to variable volume length and amenable to different computational budgets. We evaluated the proposed approach in retinal OCT volume classification and achieved 21.96% average improvement in balanced accuracy on a 9-class diagnostic task, compared to state-of-the-art video transformers. Our findings show that varying the volume-wise resolution of the input during training results in more informative volume representation as compared to training with fixed number of slices per volume.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_35

SharedIt: https://rdcu.be/dnwJQ

Link to the code repository

https://github.com/marziehoghbaie/VLFAT

Link to the dataset(s)

https://zenodo.org/record/7105232#.ZArNp-zML60

https://people.duke.edu/~sf59/RPEDC_Ophth_2013_dataset.html

Reviews

Review #2

Please describe the contribution of the paper

The authors proposed a new variable-length classification framework for volumetric data. For the feature extraction, they designed a slice feature extractor (SFE) to extract the representation of each slice, and a volume feature aggregator (VFA) to combine the slice representations into a volume-level representation. They used the linearly interpolating strategy to reorganize PEs sequence for aligning variable-length data. The experimental results show a significant improvement compared to the baseline.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The authors tackle the important problem of unifying variable-length volumetric data for end-to-end classification.
2. The experimental results showed an improvement compared to the baselines on three datasets, including two public datasets and one private dataset.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The idea of aggregating slice-level representations by linearly interpolating the PEs sequence is simple and not novel.
2. The organization of experimental results and analyses are relatively poor.
3. There is a lack of fairness statements for comparative methods and for ablation experiments. The number of parameters for each experimental configuration is not given in Table 1.
4. The comparative method is not representative and single.
5. The experimental results on OLIVES do not indicate good generalizability of the proposed method.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility of the paper is feasible according to implementation details.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. In addition to designing positional encoding, it seems to me that the variation of attention scores between tokens should also be of interest in the variable length problem. This may explain the best accuracy obtained when ignoring PE on OLIVES dataset. The relationship between variable-length in computer vision and variable length in natural language processing should be analyzed.
2. Lack of rationality in the setup of ablation experiments. ViT-base has a much larger number of parameters than ResNet-18. Why not use a convolutional network of comparable size. Why did the experiment with ResNet18 as extractor not use the best VLFAT, but the worst performing AP and MP.
3. Unclear contribution of each module to the final classification performance. The results in the tables do not lead to convincing conclusions and require additional analysis in conjunction with the proposed modules. The robustness analysis only gives the results of two experiments, which is not convincing enough.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. The idea of aggregating slice-level representations by linearly interpolating the PEs sequence is simple and not novel.
2. The organization of experimental results and analyses are relatively poor.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

The authors have addressed a part of my concens, and I have modified my score as “weak accept”

Review #3

Please describe the contribution of the paper

The contribution points of the paragraph include the proposed end-to-end Transformer-based framework, the enhancement of positional embeddings, and the improved classification accuracy in retinal OCT volume classification.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paragraph introduces a Transformer-based end-to-end framework for 3D volume classification that addresses the challenges of processing inputs with variable volume resolutions, scalability, and resource efficiency.
2. The proposed approach uses local-similarity-aware positional embeddings, a Feature Aggregator Transformer module, and a novel training strategy called Variable Length FAT (VLFAT) to process volumes with variable number of slices both at training and test time.
3. The proposed approach achieves state-of-the-art performance in retinal OCT volume classification on a private dataset with nine disease classes and competitive performance on a two-class public dataset.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. In table II, why ViT/FAT (sinPE) achieves superior results than the proposed method in CNV2. I suggest more explanation of this phenomenon to provide some insights for this task.
2. In table I, why ViT/1DConv achieves exactly the same results like ViT/VLFAT (ours) on OLIVES.
3. The figure 2 and figure 3 are stretched, please consider rescale them for a more clear demonstration.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility of this paper is promising, The architecture of the network, the implementation details and the data for training and evaluation are described in detail.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Please refer to the weakness part and refine the experimental section.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method is technically sound, the organization of the paper is promising, and the evaluation of the proposed method is sufficient. I recommend the acceptance of this paper
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

A transformer-based model was proposed for 3D volumetric OCT classification with variable number of b-scans. The model showed superior performance than compared methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper addressed a challenging practical problem of classifying OCT with variable b-scan slices. The model showed superior performance than compared methods. The paper overall is clear and easy-to-follow.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The novelty of the proposed method is limited, but is a good application of Transformer. Some details need to be included, please see later comments for more details. SOTA MIL methods can be included for a more informatic comparison.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It seems the authors are not able to release their code. Considering the missing details, the paper in its current version might not be easily reproduced.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. During training, the authors randomly sample n slices. How to ensure that the most informatic slices, e.g., the slices containing lesions, can be included so that the volume-wise label won’t change? How was the volume input size set to 25x224x224 (is it by interpolation)?
2. How is the balanced accuracy computed? Is it balanced over different classes/diseases? How many samples are there for each class? Why BAcc and AUC are used here? Please consider adding other metrics like precision, sensitivity, and confusion matrix, as well.
3. It seems that Duke and 9C are all used for training. Why is it said that Duke is used for ‘pre-training’?
4. For tables 1 and 2, please set all reported numbers to have the same precision. Please also add citations to the baseline models.
5. MIL is an emerging topic in medical image analysis. Please consider adding more MIL methods for comparison, such as: a. Ilse, Maximilian, Jakub Tomczak, and Max Welling. “Attention-based deep multiple instance learning.” International conference on machine learning. PMLR, 2018. b. Wang, Xi, et al. “UD-MIL: uncertainty-driven deep multiple instance learning for OCT image classification.” IEEE journal of biomedical and health informatics 24.12 (2020): 3431-3442. c. Hu, Ting, et al. “A multi-instance network with multiple views for classification of mammograms.” Neurocomputing 443 (2021): 320-328.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper overall is clear, the results are interesting, and the problem solved is indeed pracitical. However, the paper lacks a fair comparison with the literature and the clarity could be improved.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

Authors present a transformer-based model for 3D volumetric OCT classification with variable number of b-scans. Reviewers see some merit in the paper, and also raise concerns on the novelty and experimental results, especially comparison with SOTA methods and the method’s generalization ability. Authors should try to address these and other concerns in their rebuttal.

Author Feedback

We thank all the reviewers for appreciating the merits of our work as 1) the addressed problem is challenging and practical (R2-3-4), 2) the proposed method achieves SOTA performance in OCT volume classification (R2-3-4), 3) sufficient evaluation (R3), 4) clarity and organization (R2-3-4), 5) scalability and resource efficiency (R3). In the following, we address the main concerns raised.

Novelty (R2-4, AC) The novelty in our method comes from the combination of two aspects: 1) a late-fusion-based model for volume classification composed of a slice feature extractor (SFE) to extract slice-level biomarkers and a volume feature aggregator (VFA) to integrate slice-wise embeddings. 2) a variable length feature aggregator Transformer (VLFAT) with enhanced learnable positional encodings (PEs) as VFA, where PEs interpolation allows handling an arbitrary number of slices per volume (volume resolution) both at training and inference times. To the best of our knowledge, this approach has not been applied to volume classification tasks before, and most of the current solutions are based on pooling methods, that ignore 3D context information[16, 17, 23]. We highlight that this straightforward yet effective approach results in a single model processing a wide range of volume resolutions both at training and inference times, which generally would not be possible except for an ensemble of models of different scales. More importantly, VLFAT is an efficient solution in case of training time/memory constraints that necessitate drastic slice subsampling and is also robust against extreme PE interpolation at test time for high-resolution volumes. Importantly, our approach is model-agnostic and can be applied to models with a Transformer backbone.

Comparison with SOTA (R2-4, AC) We specifically selected two versions of ViViT to serve as the closest and most relevant baselines, since the comparison to them can better manifest the role of separate feature extractors and late fusion. Specifically, FE ViViT is similar to our approach and follows late-fusion, while FSA ViViT is a slow-fusion approach and processes spatiotemporal patches as tokens. In addition, we compared to methods with non-learnable poolings as VFA (e.g. ViT+MP), which can be considered as MIL-based baselines and also represent viable alternatives to the proposed method.

Rationality in ablation studies (R2-4, AC) In our ablations, we wanted to study the role of both SFE and VFA modules. For SFE, we verify whether the features extracted by ViT are stronger than those by a standard CNN. In this set of experiments, we used poolings as VFA where the quality of slice-wise features is more influential. We opted for ResNet18 as it is a standard in medical image analysis[26], and results suggest that ViT extracts stronger features than CNNs. For VFA, we compared pooling methods[27], transformers, and Conv1D and showed the necessity of a learnable VFA and the superiority of Transformers over Conv1D. By comparing different positional encodings, we emphasized the superior/competitive performance of learnable PEs specifically learned by VLFAT. Regarding the SFE ablation, we agree with R2 that ViT-base and ResNet18 are not comparable in size. For completeness, we now provide the results using a ConvNeXt as SFE with max pooling as VFA, achieving BAcc: 0.42 on the 9C dataset that is better than Resnet18 (0.33) but still much lower than ours (0.77)(#parameters in Convnext_base+MP: 87M and ViT+MP: 85M).

Generalization on OLIVES (R2-3-4, AC) OLIVES contains typical cases of DR and DME[15], which makes the classification task easier compared to the 9C dataset, which includes patients with different disease severity, and hence the models achieve competitive performances even with simpler VFAs (pooling and Conv1D). In addition, the competitive advantage of VLFAT, i.e. assessing volumes of different resolutions, is not exploited in this dataset since 99% of the cases have the same number of slices.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal has addressed a major part of reviewers concerns, especially around novelty and generalizability.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This is a novel study focuses on classification tasks in volumetric data. I recommend acceptance based on the positive review and careful rebuttal.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

After the rebuttal, all reviewers agree with the acceptance of the paper and concerns about novelty and experimental results are solved. The authors are encouraged to comply with other minor comments and prepare their final version.

back to top

Transformer-based end-to-end classification of variable-length volumetric data