Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Qian Zhou, Hua Zou, Haifeng Jiang, Yong Wang

Abstract

As the primary treatment option for cataracts, it is estimated that millions of cataract surgeries are performed each year globally. Predicting the Best Corrected Visual Acuity (BCVA) in cataract patients is crucial before surgeries to avoid medical disputes. However, accurate prediction remains a challenge in clinical practice. Traditional methods based on patient characteristics and surgical parameters have limited accuracy and often underestimate postoperative visual acuity. In this paper, we propose a novel framework for predicting visual acuity after cataract surgery using masked self-attention. Especially different from existing methods, which are based on monomodal data, our proposed method takes preoperative images and patient demographic data as input to leverage multimodal information. Furthermore, we expand our method to a more complex and challenging clinical scenario, i.e., the incomplete multimodal data. Firstly, we apply efficient Transformers to extract modality-specific features. Then, an attentional fusion network is utilized to fuse the multimodal information. To address the modalitymissing problem, an attention mask mechanism is proposed to improve the robustness. We evaluate our method on a collected dataset of 1960 patients who underwent cataract surgery and compare its performance with other state-of-the-art approaches. The results show that our proposed method outperforms other methods and achieves a mean absolute error of 0.112 logMAR. The percentages of the prediction errors within ± 0.10 logMAR are 94.3%. Besides, extensive experiments are conducted to investigate the effectiveness of each component in predicting visual acuity.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43990-2_69

SharedIt: https://rdcu.be/dnwMt

Link to the code repository

https://github.com/liyiersan/MSA

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors takes preoperative images and patient demographic data as input to leverage multimodal information for BCVA prediction. Furthermore, they expand the method to case with the incomplete multimodal data, and the results are not bad.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    It is a very meaningful study to use preoperative images and patient demographic data to predict postoperative BCVA after cataract surgery. Moreover, the article is well written, with clear organization and logic.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors do not provide the BCVA gold standard distribution of the samples used in the experiment, and it is difficult to see the predicted effect from the mean absolute error of 0.112logMAR.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The method of the paper is reproducible, but because the data is private, the experimental results are difficult to reproduce for others.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. In addition to MAE, it is recommended to calculate R squared and relative errors, such as MAPE and SMAPE.
    2. Figure 1(a) B-Scan is recommended to be replaced by ultrasound. In addition, all abbreviations should be given their full names when they first appear.
    3. What does the horizontal axis of Figure 1 (b) represent?
    4. It is recommended to make it more obvious that the input of the fusion module in Fig. 2 is the features of all modals.
    5. For regression tasks, the class activation map given in the supplementary material makes little sense.
    6. It is recommended to supplement the distribution of predicted truth values, and the Bland-altman plot of predicted and true values.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the methods proposed by the author have certain novelty, such as mask attention map, and the content of the article is complete. However, it is difficult to judge the degree of prediction result because the target value distribution is not clear. The prediction maybe the mean values.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper describes a multimodal approach for predicting visual acuity using multimodal inputs including text data from electronic health records and image data from scanning laser ophthalmoscopy, optical coherence tomography, and b-scan ultrasound. The authors find that visual acuity prediction is improved by using multiple modalities over single modalities and utilize self-attention to learn even missing data elements. Loss is composed of variable binary cross-entropy (controlled by a scalar, multiplicative hyperparameter) as well as MSE between ground truth and predicted visual acuity.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper uses multiple clinically relevant imaging modalities, including text data, ultrasound, scanning laser ophthalmoscopy, and optical coherence tomography data. The learning of self-attention for imputing missing data points also addresses a clinically valid challenge of missing data. The use of SOTA transformer models to improve accuracy is a strength as well.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper’s weaknesses include that the approach may not be generalizable to data modalities beyond OCT, ultrasound, and SLO. Furthermore, it is not clear if training/testing data is not paired, so use of new OCT, ultrasound, and SLO data could cause accuracy/generalizability to suffer. Addressing this question of the impact of using paired or not using paired data in discussion would be helpful. Lastly, since the loss function is rather simple, just wondering if authors have considered enhancing it through addition of more regularization terms, one for each image modality (or some similar approach).

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Paper is thorough in providing all attributes used for ease of reproduction of the work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Inclusion of the justification/motivation for the use of these particular image modalities would help to improve the paper’s quality. Also, some explanation of the impact of the self-attention specifically, text encoding specifically, and role of each modality input on final results would help to enhance nuanced understanding of results and may gauge its performance on unseen/unpaired data.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper describes a multimodal approach for predicting visual acuity using multimodal inputs including text data from electronic health records and image data from scanning laser ophthalmoscopy, optical coherence tomography, and b-scan ultrasound. The authors find that visual acuity prediction is improved by using multiple modalities over single modalities and utilize self-attention to learn even missing data elements. Loss is composed of variable binary cross-entropy (controlled by a scalar, multiplicative hyperparameter) as well as MSE between ground truth and predicted visual acuity. Inclusion of the justification/motivation for the use of these particular image modalities would help to improve the paper’s quality. Also, some explanation of the impact of the self-attention specifically, text encoding specifically, and role of each modality input on final results would help to enhance nuanced understanding of results and may gauge its performance on unseen/unpaired data.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    Accurately predicting Best Corrected Visual Acuity (BCVA) remains a challenging task in clinical practice, particularly after cataract surgery. To address this issue, the paper employs a multimodal approach, taking into account modality-missing problems, to predict BCVA. The experimental results demonstrate promising outcomes.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The pre-operative prediction of post-cataract surgery visual acuity remains a challenge in clinical settings. The authors propose a multimodal approach to enhance the preoperative patient status description and improve predictive performance. The authors also account for the common occurrence of modality deficits in clinical practice. 2) To address medical scenarios, the authors devise an auxiliary classification task to fine-tune the pre-trained model for natural settings. 3) This paper introduces masked self-attention, a simple yet effective approach for handling modal deficits.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The paper lacks sufficient detail on the dataset distribution. In clinical practice, preoperative visual acuity can significantly impact postoperative visual acuity prediction, with severe cataracts posing more difficulty than mild cases. It would be beneficial to perform subgroup analyses on the experimental results to investigate this effect. 2) The influence of different modalities, particularly SLO, on the results is not thoroughly discussed, which is critical in the selection of clinical examinations. 3) The paper does not present the results of CTT-Net (OCT) in 3.2, and instead compares the results of CTT-Net (OCT+Text) with Wei (OCT). A single variable comparison may facilitate better analysis.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors claim that the code will be released, but there is a lack of publicly available datasets for validation.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    1) Since there is ample space, it is recommended to provide more details about the composition of the “two images” or “one image”. Additionally, based on the chart, it appears that one-third of the samples have complete image modalities instead of the stated quarter. 2) On page 3, in the third line of the Image Encoder section, the phrase “However, it can load the…” could be rephrased as “The pre-trained weights can be utilized to expedite convergence. 3) There are typographical errors in Eq.4 that require correction.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Clinical significance and encouraging results

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors takes preoperative images and patient demographic data as input to leverage multimodal information for BCVA prediction. Furthermore, they expand the method to case with the incomplete multimodal data, and the results are not bad. The significance of this study, the novelty of the proposed method, and the organizational structure of the manuscript were recognized by three reviewers. However, two reviewers mentioned that there are deficiencies in the description of datasets. One is missing the distribution of the cataract degree, and the other is missing the distribution of postoperative BCVA. These two points are important to show that the results are valid. Therefore, I recommend that the authors supplement the detailed information of the dataset in the revison, possibly with a confusion matrix of the real values and predictions, and also address other issues raised by the reviewers.




Author Feedback

Q1: Horizontal axis of Figure 1 (b). (R1, R2) A1: Horizontal axis of Figure 1 (b) means the number of images about one eye. For example, “three images” means that those eyes correspond to three different examination images (OCT, Ultrasound, and SLO). Detailed descriptions will be added later. Q2: Dataset distribution. (R1, R2) A2: Most preoperative visual acuity is within 0.25~1.75 LogMAR, while most postoperative visual acuity lies in -0.05~0.4 LogMAR. Predicted visual acuity is mainly consistent with postoperative visual acuity. Distribution histograms about preoperative/postoperative/predicted visual acuity will be added later. Q3: More metrics. (R1) A3: Since visual acuity may be zero, we don’t calculate MAPE. Here we show some SMAPE results. Ours (OCT): 65.615±1.690, Ours (OCT+Text): 62.550±1.668, Ours (All): 57.165±1.610. Note that all values are percentages. More detailed results will be added later. Q4: The class activation map for regression tasks. (R1) A4: Grad-CAM is calculated based on gradients, thus we can apply it to regression tasks. Here is a MATLAB example. https://www.mathworks.com/help/deeplearning/ref/gradcam.html. Q5: The impact of different image modalities. (R2, R3) A5: OCT can clearly show morphological structures and lesions of the fovea. Ultrasound mainly shows the degree of opacity in the eye lens. SLO has a wider view and can clearly show the entire fundus. Since there are fewer SLO images (only 988 images) and transformers are data-hungry, we don’t compare the importance of SLO with other modalities. Q6: Subgroup analyses about mild/severe cataracts. (R2) A6: For severe cataract patients, their preoperative visual acuity is usually poor. We find that the prediction accuracy drops by 1.5% when preoperative visual acuity is higher than 1.0 LogMAR. Details will be added later. Q7: The results of CTT-Net (OCT). (R2) A7: We have conducted the experiments, here are the results of CTT-Net (OCT), MAE:0.174±0.013, Acc:0.872±0.016. Our approach still outperforms CTT-Net a lot. The metrics of CTT-Net (OCT) have decreased compared to CTT-Net (OCT+Text), showing the importance of preoperative visual acuity. Q8: The generalizability. (R3) A8: Our framework is a multimodal framework and can be easily extended to other fundus images like multicolor/color images. This is due to the shared powerful transformer architecture in image encoders. When taking more image modalities as input, we only need to add more image encoders with slight modifications. Q9: Paired/unpaired data. (R3) A9: We use paired data to train and test all models in our experiment. The images of different modalities about the same eye are not registered, and that’s why generative methods don’t work. If we use images from different eyes to predict a certain eye’s visual acuity, the model will not make sense. Q10: Loss function. (R3) A10: For simplicity and fairness, we just use MSE in the paper, which is consistent with other methods (e.g., CCT-Net and the ensemble model proposed by Wei et al.). We believe that the performance can be improved if taking a more complex regression loss. This will be considered for future extensions. Q11: The impact of different modules. (R3) A11: As shown in Table 2 (in the paper), auxiliary classification loss and masked self-attention mechanism are more important than text combining. Specifically, the auxiliary classification loss enables the image encoder to learn disease-related features. The masked self-attention mechanism avoids the negative impact of missing modalities on available modalities, which is of great importance for effective fusion. By adding text prompts to combine text, text encoders are much easier to extract semantic features. Overall, images can provide more useful information than text, so the auxiliary classification loss is more effective than text combining. For missing multimodal learning, feature fusion is more important, so the masked self-attention mechanism contributes the most.



back to top