Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yihua Sun, Dawei Li, Seongho Kim, Ya Xing Wang, Jinyuan Wang, Tien Yin Wong, Hongen Liao, Su Jeong Song

Abstract

Retinal thickness map (RTM), generated from OCT volumes, provides a quantitative representation of the retina, which is then averaged into the ETDRS grid. The RTM and ETDRS grid are often used to diagnose and monitor retinal-related diseases that cause vision loss worldwide. However, OCT examinations can be available to limited patients because it is costly and time-consuming. Fundus photography (FP) is a 2D imaging technique for the retina that captures the reflection of a flash of light. However, current researches often focus on 2D patterns in FP, while its capacity of carrying thickness information is rarely explored. In this paper, we explore the capability of infrared fundus photography (IR-FP) and color fundus photography (C-FP) to provide accurate retinal thickness information. We propose a Multi-Modal Fundus photography enabled Retinal Thickness prediction network (M²FRT). We predict RTM from IR-FP to overcome the limitation of acquiring RTM with OCT, which boosts mass screening with a cost-effective and efficient solution. We first introduce C-FP to provide IR-FP with complementary thickness information for more precise RTM prediction. The misalignment of images from the two modalities is tackled by the Transformer-CNN hybrid design in M²FRT. Furthermore, we obtain the ETDRS grid prediction solely from C-FP using a lightweight decoder, which is optimized with the guidance of the RTM prediction task during the training phase. Our methodology utilizes the easily acquired C-FP, making it a valuable resource for providing retinal thickness quantification in clinical practice and telemedicine, thereby holding immense clinical significance.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43990-2_55

SharedIt: https://rdcu.be/dnwMa

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a method to predict retinal thickness and ETDRS grid from IR/color fundus images. The proposed network combines both CNN and transformer and shows improvements over baseline methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper considered the problem of retinal thickness prediction (RTM and ETDRS grid), which is of clinical importance. The problem is also challenging from computer vision/deep learning perspective.

    The proposed method does not require OCT that is costly and time-consuming. Instead, it takes IR and color fundus images, both of which are commonly used in clinics.

    The paper is easy to follow with good illustrations and visualizations.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    RTM is estimated from both IR and color fundus images whereas ETDRS is from color funuds only. The motivation of using bottorreh RTM and one for ETDRS is not clear.

    The paper claims that the decoder for ETDRS prediction is “guided by the RTM prediction task during training”. However, there is no information from RTM prediction added to ETDRS prediction. I can see RTM prediction incorporates information from ETDRS prediction via the concatenation operation but not vice versa

    One of the key challenges discussed in the paper is the misalignment of IR/color fundus images. There is no analysis on how robust the method is w.r.t. the misalignment of input images.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper includes sufficient implementation details.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please see my concerns in “the weakness of the paper”

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While there are several issues in the paper, I think it has mertis (important task, good presentation, and good results) that weigh over the weakness.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    I appreciated the effort to address my concerns. Overall I think the paper has mertis that is weigh over its weakness.



Review #2

  • Please describe the contribution of the paper

    The authors use a combined U-Net-like and Transformer structure to predict retinal thickness maps and ETDRS grid subfield thicknesses (ST) from Color fundusphotography (C-FP) and infrared FP (IR-FP). In addition a decoder trained on combined C-FP and IR-FP is used to predict ETDRS ST solely from C-FP.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Using a transformer network to align two modalities (C-FP and IR-FP)
    • Using encoder features of both models to predict grid-region measures.
    • Method is extensively evaluated including result discussions and ablation studies. It shows the benefit of the complement information in a multi-modal approach.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Dataset is not described well. This makes it difficult to bring the reported MAE in relation to the dataset. For a dataset with treated DME and almost no abnormal retinal thickness the prediction task is much easier and a lower MAE can be expected than for a dataset with highly diverse retinal thicknesses, where the reported MAE is acceptable.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Method is well described. Dataset and source code is not available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Major

    • “ETDRS grid prediction” is unclear. You do not predict a grid (eg spatial location) but the aggregated thickness for each ETDRS grid subfield. Define this term already early in the introduction and not in the experiments section 3.2. Maybe use another term, as for me ETDRS grid mainly has the meaning of dividing the retina into specific subfields, for which you can aggregate arbitrary structural and functional measurements.

    Provide some descriptive measurements of the used dataset to get a feel of the distribution of retinal thicknesses (eg. mean + SD thickness, histograms etc.)

    Minor

    Whereas in general the paper is well written, an additional iteration and restructuring may help to understand the paper easier. Some parts are redundant and can be removed. Also think whether some paragraphs may be moved from experiments to method, and furthermore distinct even more between experimental setup, results and discussion.

    Provide standard deviation for MAE measurements if there is enough space.

    From a clinical perspective the central subfield thickness is the most relevant thickness biomarker in a DME setting. For the MICCAI paper there is probably not enough space but for further work I suggest to report MAE for the central 1mm and the central ETDRS subfield.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Using a transformer to align modalities and use CNN and transformer features for regression is a nice approach. Method was properly evaluated and paper i s well written.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    Propose a multi-modal fundus photography enabled retinal thickness prediction network, which is the first to predict retinal thickness map with CFP images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The CFP model was first used for RTM prediction.
    2. A multi-modal fundus photography enabled retinal thickness prediction network was proposed.
    3. It maybe has some clinical application meaning if the prediction is accuracy and robust, because only CFP images without OCT can estimate the RTM.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The reasonability of the proposed network is not clearly explained. For example, why the Transformer architecture can deal with the misalignment CFP and IRFP images?
    2. The details about the input images are missed. For example, what’s the original size of CFP and IRFP images? and how to implement “The shape of IR-FP, C-FP, and RTM is 544x544”?
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It’s not easy to reproduce this paper, because many details are missed.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Provide the clear explanation about how to deal with the misalignment CFP and IRFP images.
    2. Provide the details about the input images, such as the size.
    3. In the sentence “OCT exams are only available to limited patients as it is both costly and time-consuming” is not very proper, because OCT exams are not time-consuming.
    4. In experiments, the dataset is relatively small and only one disease (DME) was used.
    5. What’s automatic algorithm employed to segment the membrane layers?
    6. What did the PSNR be used to measure the performance?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The reasonability of the propose network and many details are unclear.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    A novel method that combines CNN and transformer networks to predict retinal thickness and ETDRS grid from IR/color fundus images has been proposed. The method is easy to follow and has been extensively evaluated, showing improvements over baseline methods. The use of a transformer network to align two modalities is a novel approach that has not been explored before. The authors also showed the benefit of the complement information in a multi-modal approach, which is an important finding that has implications for other medical imaging applications. But the detailed analysis and descriptions about their datasets and networks are missing. I recommend that the authors provide more detail in their rebuttal.




Author Feedback

We sincerely thank the reviewers for their valuable comments and suggestions.

Q(R2, R4): Dataset details and why only focus on one disease, DME? A: Our dataset consists of patients diagnosed with macular edema who underwent intravitreal injections. The averaged retinal thickness is 275.92um with std.=20.91 among the dataset. We focused on only DME, because other than DME, predicting retinal thickness itself has relatively less clinical value. E.g., for age-related macular degeneration, we need to look for subtle changes in abnormal OCT features (subretinal fluid, pigmentary epithelial detachments) rather than retinal thickness. So, focusing on just DME for our study was reasonable.

Q(R4): Why Transformer encoder E_T can deal with the misalignment of C-FP&IR-FP? A: Transformer uses attention mechanisms to determine correlations between subfields (Sec. 2.1). Through end-to-end training, E_T enables the network to establish stronger correlations in matched areas using learnable parameters within the attention. This capacity enables the network to tackle the misalignment between two FPs, and facilitates lower loss during training. In contrast, a fully convolutional approach concatenates features strictly to their spatial locations regarding the input images, even if the features are misaligned, which limits the performance of RTM prediction (Table 1&2, Fig. 2). Moreover, RTM requires pixel-wise correspondence to IR-FP, so it’s better to deploy E_T for C-FP, while CNN for IR-FP.

Q(R1): Why ETDRS prediction can be guided by RTM prediction task during training? A: Following the above question, encoder E_T is guided for aligned features by the fine-grained RTM prediction task. Then the regional thickness predictions in ETDRS, decoded from E_T, can be improved (Table 2).

Q(R1): The motivation for using both images for predicting RTM and only C-FP for ETDRS grid. A: Doctors acquire C-FP upon patients’ arrival (Sec. 2). If RTM is deemed necessary for diagnosis, another device will capture IR-FP and conduct OCT scanning, where RTM is pixel-wise registered with IR-FP. So, researchers aimed to predict pixel-wise corresponded RTM from IR-FP (DeepRT [9]), and we novelly incorporate the complementary information from multi-modal C-FP that improves the RTM prediction accuracy. Direct prediction of RTM solely from unregistered C-FP is not straightforward, while the ETDRS grid is a regionally averaged thickness that does not require strict pixel-wise correspondence. So, we manage to predict ETDRS grid from C-FP. Predicting ETDRS grid solely from C-FP without IR-FP also has clinical significance for telemedicine (Sec. 1), since C-FP can be easily taken with a smartphone.

Q(R4): Network details & input details. A: Detailed configurations of our network are in the supplementary material. E_T can be implemented by 2D “UNETR” encoder in MONAI with provided parameters. For IR-FP: we center-crop the area corresponding to the OCT scanning area with a resolution of 544x544, and then calculate RTM ground truth within. For C-FP: we resize it to 544x544 from an original resolution of 3608x3608.

Q(R4): Why OCT exams are time-consuming? A: OCT exams may take minutes, but in real scenarios, accounting for pupil dilation, OCT exams, and confirming with ophthalmologists can significantly prolong the process in busy clinics.

Q(R4): What’s the algorithm that segments the OCT layers? A: We use a Heidelberg OCT machine equipped with a built-in membrane layer segmentation method. We export segmentation results from the machine and exclude results with expert-confirmed errors.

Q(R2): MAE for the central subfield. Std. of the results. A: The central 1mm subfield is G_1 as defined in Fig. 1(c1&c2), and MAE is in Table 1&2. We will provide std. in the supplementary material.

Q(R2): ETDRS grid prediction does not predict spatial locations of the grid. A: We only predict quantitative numbers in pre-defined ETDRS grid areas. We will define this term in introduction.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Pros:

    • Topic: The paper proposes a method to predict retinal thickness and ETDRS grid from IR/color fundus images.
    • Experiments: The method is extensively evaluated including result discussions and ablation studies. It shows the benefit of the complement information in a multi-modal approach.
    • Style: The paper is easy to follow with good illustrations and visualizations. Cons:
    • Motivation: The reasonability of the proposed network is not clearly explained.
    • Clarity: The details about the dataset are not well described. After Rebuttal: -the authors failed to convince the reviewer gave low score, but to me, the clinical need and novelty is sufficient for a conference paper, and some missing details can be imprved in the revision; +the two positive reviews are general consistent to acknowlege the contribution of this work



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposes a method to predict retinal thickness and ETDRS grid from multi-modal fundus photography. The proposed network combines both CNN and transformer and shows improvements over baseline methods. The main concerns arised by reviewers are about dataset and misalignment between IR and fundus. After rebuttal, the authors partitially addresses those concerns.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper presents a novel approach to predict retinal thickness and the Early Treatment Diabetic Retinopathy Study (ETDRS) grid from infrared (IR) and color fundus images. The approach adeptly blends convolutional neural network (CNN) and transformer architectures, offering a marked enhancement over the existing methods. The paper’s substantive contribution and originality are praiseworthy. That said, the reviewers did voice some concerns that require attention. Chief among these was a certain ambiguity regarding the motivation behind employing both IR and color fundus images for the estimation of retinal thickness map (RTM), and only color fundus images for ETDRS grid prediction. In response, the authors have clarified the clinical diagnosis process, detailing the types of images needed at various stages, and explaining how their methodology dovetails with this process. While this is satisfactory, it is suggested that the authors further clarify these details in the paper for the benefit of the reader. The reviewers also noted an absence of comprehensive information about the dataset, which complicates the interpretation of the reported mean absolute error (MAE). The authors have since furnished additional information, including details about the patient conditions and the data characteristics, which primarily focus on Diabetic Macular Edema (DME). Given the context, the specificity of the dataset seems justified; however, it is recommended that the authors integrate this information into the main paper to bolster transparency. In conclusion, the authors have provided satisfactory responses to the concerns raised by the reviewers. Consequently, I propose accepting the paper, subject to minor revisions. Specifically, the authors should enhance the clarity and detail in the paper’s main body concerning the points discussed during the review process, thereby significantly improving the paper’s overall quality and readability.



back to top