Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Zhiyuan Cai, Li Lin, Huaqing He, Xiaoying Tang

Abstract

A large-scale labeled dataset is a key factor for the success of supervised deep learning in computer vision. However, a limited number of annotated data is very common, especially in ophthalmic image analysis, since manual annotation is time-consuming and labor-intensive. Self-supervised learning (SSL) methods bring huge opportunities for better utilizing unlabeled data, as they do not need massive annotations. With an attempt to use as many as possible unlabeled ophthalmic images, it is necessary to break the dimension barrier, simultaneously making use of both 2D and 3D images. In this paper, we propose a universal self-supervised Transformer framework, named Uni4Eye, to discover the inherent image property and capture domain-specific feature embedding in ophthalmic images. Uni4Eye can serve as a global feature extractor, which builds its basis on a Masked Image Modeling task with a Vision Transformer (ViT) architecture. We employ a Unified Patch Embedding module to replace the origin patch embedding module in ViT for jointly processing both 2D and 3D input images. Besides, we design a dual-branch multitask decoder module to simultaneously perform two reconstruction tasks on the input image and its gradient map, delivering discriminative representations for better convergence. We evaluate the performance of our pre-trained Uni4Eye encoder by fine-tuning it on six downstream ophthalmic image classification tasks. The superiority of Uni4Eye is successfully established through comparisons to other state-of-the-art SSL pre-training methods.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_9

SharedIt: https://rdcu.be/cVRYM

Link to the code repository

https://github.com/Davidczy/Uni4Eye

Link to the dataset(s)

https://ichallenges.grand-challenge.org/iChallenge-AMD/

https://ichallenges.grand-challenge.org/iChallenge-PM/

https://www.kaggle.com/c/diabetic-retinopathy-detection/data

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a self-supervised architecture allowing to extract latent representations for both 2D and 3D heterogeneous ophthalmic imaging data. These latent features can then be fine-tuned on different classification tasks. A first step allows extracting a fixed number of patchs (2D or 3D) from the input images based on random masking of the images. These patchs are then fed to a pretrained visual transformer block. Two decoder blocks acting on all patches (non masked ones with visual attention, and masked ones) allows reconstructing the original images as well as the gradient images. Performance of this self-supervised latent space learning is evaluated on six classification tasks based on a dataset concatenating more than 95 000 samples aggregated from different public datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The general framework design to adapt to different types (fundus and OCT, 2D or 3D) of ophthalmic data -Implementation of mask autoencoder is original, as far as I know and efficient -Comparison with state-of-the art method is performed -The authors perform some ablation study, to evaluate the impact of the different components of their architecture (mixing 2D and 3D data, use of two different decoders..)
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

-The dataset contains very heterogeneous data types . Although the authors perform different types of ablation studies, it is sometime hard to disentangle the impact of the different patterns of the heterogeneous datasets, eg does performance on some type of images (eg fundus from GAMMA) benefits only from the latent representations learned on additionitional images from the same type (eg fundus from EyePACS) or also from additional images of different patterns (eg 3D OCT). Might be interesting to add some ablation study training the SSL network on the different subtypes of images. (Similarly to what was done in Table 2 of the Appendix when considering a separate training on 2D or 3D data). At least, discussion should be enlarged to address this point.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors answered positively to all questions, which does not exactly match the paper content, but the authors mention that they will make the code available which is good!
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- Short descriptions of the mmOphth-v1 ophthalmic dataset as well as the evaluation strategy (split into train, test, validation ..) should be provided in the main paper. -Details concerning the backbone architectures of the encoder and decoder should be provided. As an example, the authors refers to Vit-large and ViT-base which is not clear for non expert readers. -OCT and fundus images have very different texture patterns. It would be interesting to provide hypothesis on the ViT module adapts and benefits from such a different data types. -3D OCT images should be better described, eg OCT enface?? Illustrations are provided in the Appendix but no explanation regarding the difference between the different 2D and 3D image types. Would be nice to add some sentences on the differences and clinical practice regarding these different images.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper contains interesting ideas, experimentation is well conducted, the ablation study might be improved to disentangle the impact of the different image types.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #2

Please describe the contribution of the paper

Authors propose Uni4Eye, a self-supervised pre-training approach using Masked Image Modeling (MIM) and vision transformers to learn universal and relevant features of various 2D and 3D ophthalmic imaging modalities. The approach resides in proposing a Unified Patch Embedding module to handle both 2D and 3D data with MIM and multitask learning with the reconstruction of both the input images and its gradient map. Authors introduce and use mmOphth-v1, a large dataset of 2D and 3D ophthalmic image with numerous modalities, which will be made available publicly. The relevance of the proposed pertaining is showed on multiple downstream classification tasks (4 in 2D, 2 in 3D) in comparison to state of the art methods, as well as ablation studies.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Universality of features: the dataset collected include numerous ophthalmic images modalites and the pretrained model is able to achieve state-of-the-art results on several classification tasks (proposed by different challenges) by fine-tuning, which could make their model a universal pretrained backbone for ophthalmic images features generation and a variety of downstream tasks. New large public ophthalmic images dataset: authors collected and created a dataset that will be made publicly available containing a wide variety of ophthalmic images modalities as well as 2D and 3D images (although missing some description on the contribution to the data collection).
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Slight lack of clarity in the evaluation: authors denote the downstream by the related challenge or dataset, which do not necessarily describe what is the actual classification task. The 3D results only report 1 comparison method (which is not state-of-the-art) whereas the 2D comparison methods may be easily expanded to 3D. Moreover the ablation studies results are reported each for 1 different downstream task (certainly for time / resources consideration), authors could explain further their choices. Although it is a proxy task for the self-pretraining of relevant features, the reconstruction task results (only qualitative) seem quite moderate on visualization on Fig.4, even with the lowest mask ratio (25%).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The code will be made available. Still the model and training hyperparameters are thoroughly described. I made a little comment for the authors on the ViT architecture details and the setting/tuning of the loss weights. The created / collected dataset is described in the appendix and will be made publicly available (although missing some description on the contribution to the data collection).
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
Major comments:
- Table 1 of supplementary materials show some details on the collected and created dataset mmOpthth-v1. The table mostly show that mmOpthth-v1 is the concatenation of already existing public dataset (OCTA-500, GAMME, EyePACS, PRIME-FP20 and Synthesized FFA). Some further description / details on the contribution of authors to the collection of the data would be welcome in the main paper.
- To further investigate the relevance of the self-pretraining and generated features, authors could show downstream task results while freezing the pretrained model and only training a classification head (e.g. linear layer, MLP) on top of it.
- Authors decided not to describe (briefly) the ViT architecture, still two ViT sizes are used (-base and -large). A detail on the important architecture hyperparameters could be added for both sizes.
- Authors propose to only feed the visible patches to the ViT encoder. Could this hamper the spatial information between the patches ? Is there any strategy to account for that ? Authors could explain / discuss further this point.
- For 2D inputs, the proposed UPE module outputs 2D square patches, whereas for 3D inputs, UPE module outputs 3D cubic patches, how does the ViT handle both possibilities ?
- As training batch sizes are different for 2D and 3D, authors could further explain their strategy to train over the whole 2D + 3D data sets. For example, maybe 1 training epoch = 1 epoch over the 2D data set + 1 epoch over the 3D data set ?
- Reconstruction results are assessed only qualitatively on Fig. 4, it would be interesting to see quantitative results (e.g. MAE, RMSE, SSIM …) on the whole dataset and independantly for the different modalities.
Minor comments:
- Authors could explain more on the use of the gradient map with the Sobel filters for the self-supervision task. The gradient map with such filters appear as a rough segmentation of the image and as the collected public datasets seem to sometime also provide the segmentation maps, it could be used alternatively and authors could compare to their proposed approach.
- authors state that the loss weights were set equals to “make the network concentrate equally on global intensity information and local edge information”, further investigation could be performed on those hyperparameters.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The universality of the approach for ophthalmic images using numerous imaging modalities, as well as both 2D and 3D is very appealing, considering the outperforming results of the model with fine-tuning on several 2D and 3D downstream task, mainly for disease classification. Nevertheless, the lack of methodological novelty in the straightforward self-supervision task using multi-reconstruction with MIM, producing visually moderate results and no quantitative evaluation appears as the main weakness.
Number of papers in your stack

6
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

Authors provided a rather clear and complete rebuttal to answer my comments, along with most of reviewers’ concerns. They justified most of their answers with concrete experiments, further demonstrating the relevance of their approach with numerous ablation studies.

Review #4

Please describe the contribution of the paper

The paper proposed a self-supervised method named Uni4Eye that was based on masked autoencoder with a Unified Patch Embedding (UPE) module to enable the model takes both 2D and 3D images as input, and two-branches decoder to enhance sharp edges in the reconstruction. The authors also collected the largest ophthalmic image dataset, and evaluated six downstream tasks to show the superiority of the method.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well-orgnaized.
- The evaluation is complete and thorough.
- The authors contributed what they claimed to be the largest ophthalmic image dataset.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Some parts of the method are not well-explained.
- The novelty of the method is limited. The method is based on masked autoencoder (MAE). The novel components are Unified Patch Embedding (UPE), which does not make complete sense based on their current description, and the dual-branch decoder, which is not quite novel.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The code is provided. Not sure whether the dataset is open source.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- Fig.1 is a bit confusing. In the input, fundus, OCT en-face, and OCT, are there any relationships between them? or it’s just showing the method can support one of the three inputs?
- Fig. 2 and Sec 2.1 are also unclear to me. It seems that 2D and 3D branches are unrelated, thus it can basically considered as two models? Also, it seems that the blue and orange vectors are combined together to achieve f^d, but it seems to be contradict to the part that 2D and 3D are unrelated? More explanation is needed.
- Since the method is based on masked auto-encoder (MAE), a direct comparison with MAE seems to be a natural choice to show the contribution of the proposed components, but it’s not compared. Any reason?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method needs to be better explained.
Number of papers in your stack

7
What is the ranking of this paper in your review stack?

5
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

The author feedback clarified the key component UPE. I have changed my rating to weak accept.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper is reviewed by three experts in the field. The reviews of this work are quite divergent. The authors are suggested to provide a rebuttal to clarify main issues raised by reviewers, including: 1) More details for mmOphth-v1 dataset 2) Discussion should be enlarged and add ablation study training the SSL network. 3) More details of experiments should be added, e.g., important architecture hyperparameter, visible patches to the ViT encoder, quantitative results on whole dataset. 4) More details of relationship between 2D and 3D branches. 5) The novelty of the method is limited.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

108

Author Feedback

1.R1&R2: More details on mmOphth-v1 and downstream tasks More details will be provided in the final version. This work is the first to have integrated such a large number of retinal image datasets of different modalities and dimensions. We conduct resizing, normalization, and FOV extraction to make mmOphth-v1 user-friendly. We are now collecting more datasets (including some private ones) and will release v2 in our journal extension.

2.R4: Novelty Our innovations include: 1) We are the first to use MIM to learn general visual representations from both 2D and 3D ophthalmic images. 2) We create the largest ophthalmic image dataset of multiple modalities and different dimensions. We are now creating a larger v2 version, and will release it in our journal extension. 3) We design a dual-decoder structure to well accommodate ophthalmic images.

3.R4: Comparison with MAE We perform two supplementary experiments: 1) employing an MAE model pre-trained on ImageNet; 2) pre-training MAE from scratch with only 2D ophthalmic images. We get kappa of 0.4699 vs. 0.6432 vs. 0.7228 for MAE (ImageNet) vs. MAE (ophthalmic) vs. Uni4Eye on Ichallenge-AMD.

4.R2&R4: More details on UPE R2: UPE outputs a 4D tensor. We set the 4th dimension as 768 in both Conv2d and Conv3d layers, so that the ViT encoder can handle both possibilities. R4: The blue and orange vectors in Fig. 2 are not combined to obtain f^d. Instead, f^d comes from either the 2D branch or the 3D branch. f^d is fed to a same encoder, and thus the two branches should not be viewed as two models. We will provide more details in Sec 2.1 and modify Fig. 2 in the final version.

5.R4: Confusion on Fig. 1 There is no relationship among the input Fundus, OCT en-face, OCT, etc. They do not need to come from the same subject nor the same dataset. They are simply mixed for pre-training.

6.R2: Use of the gradient map The Ichallenge-AMD and Ichallenge-PM datasets do not have segmentation maps. It is more expensive to obtain segmentation maps than gradient maps.

7.R1&R2: More details on ViT-large and ViT-base We will provide ViT details in the final version.

8.R2: Training strategy We alternatively train 1 epoch over 2D and 1 epoch over 3D.

9.R1: Impact of heterogeneous multimodal data types We pre-train on EyePACs only, EyePACs+other fundus, EyePACs+other fundus+OCT en-face, EyePACs+other fundus+3D-OCT. On Ichallenge-AMD, their kappa are respective 0.6797, 0.6901, 0.7146 and 0.7192 while that of Uni4Eye is 0.7228.

10.R2: More ablation studies We expand the compared 2D methods to 3D. The kappa from Rotation Prediction, SimCLR and SiT are respective 0.7189, 0.7221 and 0.7246 on GAMMA (3D) while that from Uni4Eye is 0.7316. Because of the space limit, we will include ablation study results on other downstream tasks in our journal extension.

11.R2: Freezing the pre-trained model In this setting, on Ichallenge-AMD, the kappa from the randomly-initialized, pre-trained with ImageNet, SiT, ViT-base and ViT-large are respective 0.1619, 0.2704, 0.2994, 0.4238 and 0.4599.

12.R2: Feeding not only the visible patches to the encoder We feed both the visible and masked patches into the encoder, inducing slight performance degradation (0.7192 vs. 0.7228 on Ichallenge-AMD) and more computational overhead.

13.R2: Moderate reconstruction results For reconstruction, the average SSIM are respective 0.9441, 0.4680, 0.7712, 0.8843, 0.8874, and 0.9483 for fundus, gradient, OCT-enface, FFA, UWF FA, and UWF FP. However, good pretext-task performance does not necessarily guarantee good downstream-task performance [Xie et al., CVPR2022]. We conduct pre-training solely on the Ichallenge-AMD dataset and obtain an average SSIM of 0.9644. However, the 0.9644 SSIM model achieves 0.6643 kappa while our 0.9483 SSIM model achieves 0.7228 kappa on Ichallenge-AMD.

14.R2: Ablation studies on loss weights For the 0.5:1 and 1:0.5 ratios, the kappa on Ichallenge-AMD are 0.7164 and 0.7192, while the 1:1 ratio achieves 0.7228.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

Overall, all reviewers are satisfied with the response given by the authors, and are glad to see that the quality of the paper has been improved substantially.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

1

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have a great job in rebuttal.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

1

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors propose a self-supervised pre-training approach to learning universal features of 2D and 3D ophthalmic imaging modalities. The reviewers acknowledged strength in novelty. After the rebuttal, all reviewers agreed the concerns were addressed and recommended acceptance unanimously.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

1

back to top

Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image Modeling Transformer for Ophthalmic Image Classification