Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

You Zhou, Gang Yang, Yang Zhou, Dayong Ding, Jianchun Zhao

Abstract

Early glaucoma can be diagnosed with various modalities based on morphological features. However, most existing automated solutions rely on single-modality, such as Color Fundus Photography(CFP) which lacks 3D structural information, or Optical Coherence Tomography (OCT) which suffers from insufficient specificity for glaucoma. To effectively detect glaucoma with CFP and OCT, we propose a generic multi-modal Transformer-based framework for glaucoma, MM-RAF. Our framework is implemented with pure self-attention mechanisms and consists of three simple and effective modules: Bilateral Contrastive Alignment (BCA) aligns both modalities into the same semantic space to bridge the semantic gap; Multiple Instance Learning Representation (MILR) aggregates multiple OCT B-scans into a semantic structure and downsizes the scale of the OCT branch; Hierarchical Attention Fusion (HAF) enhances the cross-modality interaction capability with spatial information. By incorporating three modules, our framework can effectively handle cross-modality interaction between different modalities with huge disparity. The experimental results show that the framework outperforms the existing multi-modal methods of this task and is robust even with a clinical small dataset. Moreover, by visualizing, OCT can reveal the subtle abnormalities in CFP, indicating that the relationship between various modalities is captured.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43990-2_66

SharedIt: https://rdcu.be/dnwMq

Link to the code repository

https://github.com/YouZhouRUC/MM-RAF

Link to the dataset(s)

https://ichallenges.grand-challenge.org/


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents MM-RAF, a novel multi-modal framework for glaucoma recognition using color fundus photography (CFP) and optical coherence tomography (OCT) images. MM-RAF addresses the challenges of large discrepancies, unbalanced amounts, and lack of spatial information interaction between different modalities. It consists of three key modules - Balanced Channel Attention (BCA), Multi-instance Learning with Region-based Attention (MILR), and Hierarchical Attention Fusion (HAF) - that tackle these challenges. The experimental results show that MM-RAF outperforms other solutions in multi-modal glaucoma recognition and demonstrates the potential for improving glaucoma diagnosis in clinical practice.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Here are the main strengths of the paper:

    • The MM-RAF framework is a novel approach incorporating hierarchical attention fusion with transformers for multi-modal glaucoma recognition. It addresses the challenges in multi-modal glaucoma recognition, including large discrepancies, unbalanced amounts, and the lack of spatial information interaction between different modalities (CFP and OCT images). The proposed framework’s design, consisting of BCA, MILR, and HAF modules, enables effective intra- and inter-modal interactions, addressing the unique challenges in the domain.
    • The authors construct a new dataset by collecting 872 multi-modal cases from a state hospital’s Department of Ophthalmology, in addition to using the public GAMMA dataset. This new dataset allows for a more comprehensive evaluation of the proposed MM-RAF framework and shows its effectiveness in private and public datasets. Furthermore, using pseudo labels for CFP in the training phase demonstrates a practical approach to dealing with expensive human annotations in the medical domain. -The MM-RAF framework shows promising results in multi-modal glaucoma recognition, achieving state-of-the-art performance in F1, AP, and AUC metrics. The framework’s robustness is also demonstrated in learning inductive biases from images, even with a limited dataset. This indicates the potential for clinical feasibility and real-world applicability of the proposed method. -The paper thoroughly evaluates the proposed framework, comparing it with single-modal and multi-modal solutions, as well as various baselines, including ResNet, ViT, DeiT, and Swin-Transformer. The evaluation also includes an ablation study examining the contribution of each module and the order of HAF, providing insights into the effectiveness of the proposed method. -The authors provide a visualization of the framework’s mechanism, showcasing how different modalities interact and how the framework considers correct predictions with the network going deeper. This visualization helps demonstrate the effectiveness of the proposed MM-RAF framework in extracting and combining modal-agnostic and modal-specific features for multi-modal decision-making.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -The MM-RAF framework is specifically designed for multi-modal glaucoma recognition, and its applicability to other medical imaging tasks or other domains is not discussed. This limits the generalizability of the proposed method, and more studies are needed to understand its potential in other contexts. -The authors mention the high computational cost of the self-attention mechanism in the OCT branch, which might pose a challenge in real-world applications. While they propose block embedding as a solution, it is unclear if this fully addresses the computational concerns or if there are other potential optimizations to reduce the computational burden. -The paper does not include a comparison with lightweight transformer architectures, which could potentially provide similar performance with reduced computational complexity. A comparison with such architectures would provide a more comprehensive understanding of the trade-offs between performance and computational resources. -The paper does not provide a detailed discussion on the choice of hyperparameters, such as the depth of encoders or the dimensions of self-attention. A sensitivity analysis or an exploration of how these choices affect the performance would be beneficial to understand the robustness of the proposed method. -The authors acknowledge the risk of overfitting when training the transformer-based method on the limited GAMMA dataset. Although they attempt to address this issue by pre-training on the private mid-scale dataset, the extent to which overfitting is mitigated remains unclear.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper provides sufficient details regarding the methods, dataset, and experimental setup, which contributes positively to the reproducibility of the study. However, it is not clear if the authors have made the code publicly available, which could limit the ease of reproducing the results. In addition, the authors mention using standard pre-trained models and frameworks like PyTorch and timm, which should facilitate replication if the code is shared. The specific dataset splits is provided in supplementary material. Nevertheless, based on the provided information, the study can be considered moderately reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Overall, the paper is well-written, and the proposed MM-RAF framework demonstrates promising results in multi-modal glaucoma recognition. Here are some detailed comments for the authors to consider: -The paper is generally well-organized and clear in its presentation. However, some parts may benefit from further clarification, such as the explanation of the ablation study and the visualization. Consider providing more context and details in these sections to improve readability.

    • While the authors have collected a new dataset, it would be helpful to provide more information on the dataset’s accessibility for the research community. If the dataset cannot be publicly available, consider discussing the limitations and possible alternatives for researchers to reproduce the results.
    • The paper provides a comprehensive comparison with several existing methods. However, discussing how the proposed method differs from other transformer-based methods in multi-modal glaucoma recognition would be useful. This would help readers better understand the novelty and significance of the MM-RAF framework. -Although the paper briefly mentions the possible improvement with sufficient data and the use of lightweight transformers, providing a more in-depth discussion on the limitations and potential avenues for future research would be beneficial. This could include challenges in the model’s scalability, generalizability, and any other factors that may affect its applicability in real-world clinical scenarios. -To improve the reproducibility of the paper, consider providing the code. This will greatly facilitate other researchers in reproducing and building upon your work. Addressing these points should strengthen the paper and make the contributions of the MM-RAF framework more apparent to the readers. If have also some questions:
    • In Gamma Dataset, the fundus images were acquired using a KOWA camera with a resolution of 2000 × 2992 pixels and a Topcon TRC-NW400 camera with a resolution of 1934 × 1956 pixels. How did you ensure that the pseudo labels for the CFP in your newly constructed dataset are accurate enough for effective training since you use a Topcon Maestro-1?
    • Regarding the ablation study, can you elaborate on the impact of the depth of the different modules on the performance of MM-RAF? What are the optimal depths for each module, and how were these determined?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I gave this paper a score of 6 due to the following factors: -The paper presents a novel framework, MM-RAF, which addresses the challenges in multi-modal glaucoma recognition. The three proposed modules (BCA, MILR, and HAF) effectively tackle the issues of the semantic gap, unbalanced amounts, and lack of spatial information interaction between the modalities. -The authors provide a thorough experimental evaluation, including comparisons with single-modal and multi-modal methods, as well as ablation studies to analyze the contributions of each module. -The paper demonstrates the clinical feasibility of the proposed framework by evaluating its performance on both a private dataset and the public GAMMA dataset. -The organization and clarity of the paper are good, which aids in understanding the proposed method and the experimental results. However, there are some moderate weaknesses mentioned above.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The article proposes a multi-modal Transformer-based framework, MM-RAF, for effective glaucoma recognition using fusion of CFP and OCT modalities.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors propose a multi-modal Transformer-based framework called MM-RAF for detecting glaucoma using Color Fundus Photography (CFP) and Optical Coherence Tomography (OCT). 2.The framework includes three modules to handle cross-modality interaction and can effectively bridge the semantic gap between the two modalities. 3.The experimental results show that the framework outperforms existing methods and is robust even with a small dataset.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The number of training data such as GAMMA is limited, it is better to conduct a five-fold cross-validation on the dataset for a more comprehensive evaluation.
    2. The author claim that MM-MIL outperforms the proposed method on glaucoma grading task in table 2 due to the inductive bias introduced by convolution architecture. I suggest the author replace the VIT backbone with a CNN backbone to verify this claim. 3.In table 3, why did discarding the BCA (3rd line of the table ) achieves better AUC than the proposed method that uses the BCA (last line of the table). After all, the alignment is vital for multi-modal tasks.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility is promising, the details of the network, the loss function, the dataset is well explained.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Since the author claim this framework is general for multi-modal glaucoma recognition, I wonder if this framework has considered data type like AS-OCT. It would be better to discuss glaucoma recognition based on AS-OCT [1,2] to make this framework more generic.
    2. please refer to the weakness part. [1] Ferreira, Marcos Melo, et al. “Multilevel cnn for angle closure glaucoma detection using as-oct images.” 2020 International Conference on Systems, Signals and Image Processing (IWSSIP). IEEE, 2020. [2] Yang, Yifan, et al. “Distinguishing Differences Matters: Focal Contrastive Network for Peripheral Anterior Synechiae Recognition.” Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII 24. Springer International Publishing, 2021.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper is well-organized and proposes a method with three well-designed modules for glaucoma recognition. The evaluation results are promising, although there are some issues that have not been addressed, based on the strengths of the paper, I recommend accepting it.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose MM-RAF, a multi-modal Transformer-based framework for glaucoma recognition using both CFP and OCT. The framework consists of three modules: Bilateral Contrastive Alignment (BCA), Multiple Instance Learning Representation (MILR), and Hierarchical Attention Fusion (HAF). These modules enable effective handling of cross-modality interaction between the two imaging modalities.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Experimental results show that MM-RAF outperforms existing multi-modal methods and is robust even with small clinical datasets. Visualization of the relationship between CFP and OCT reveals subtle abnormalities, indicating the framework’s effectiveness in capturing the relationship between different modalities.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Qualitative visualization comparison and analysis with other existing methods is missing.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Easy to reproduce

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Add visualization comparison and analysis with other existing SOTA methods.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a good paper overall.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper is reviewed by three experts in the field. The consensus is that the paper has merit, but there are some major issues. The main issues raised by reviewers are:

    1) Limited generalizability of the proposed method, and more studies are needed to understand its potential in other contexts.

    2) Insufficient evaluation: The paper does not include a comparison with lightweight transformer architectures

    3) To improve the reproducibility of the paper, consider providing the code. This will greatly facilitate other researchers in reproducing and building upon your work.

    4) Qualitative visualization comparison and analysis with other existing methods is missing.

    Please make sure to integrate the points raised by all reviewers when preparing the final version.




Author Feedback

We thank all reviewers and AC for their time, efforts, and valuable comments. We will incorporate the suggestion into the camera-ready manuscript. Here are our point-to-point responses.

  1. Generalization of the Model [Meta-Review&R1&R2]:
    • Our method focuses on multimodal recognition of glaucoma and other eye diseases using CFP(2D image) and OCT images(3D volume). Adjustments can be made to support other scenarios such as MRI+CT. R2 also mentions the use of AS-OCT (16 slices volume). Two mentioned papers, “Multi-level CNN” and “FC-Net,” construct dual-stream networks by dividing AS-OCT into different chambers, which can be well adapted into our network with, as mentioned above, task-related adjustments.
  2. Comparison of the Lightweight Transformer Model [Meta-Review&R1]:
    • In our ablation study, we examine different depths for the module and observe that the shallower model tends to underfit. We will take the comments into account and include the EfficientTransformer in our work.
  3. Code/Data [Meta-Review&R1]:
    • We will provide code and links to the public dataset. However, due to privacy concerns, we are unable to release the private dataset now. Hyperparameters selection will be disclosed in the supplementary materials.
  4. Qualitative Comparison in Cross-Modal Visualization [Meta-Review&R3]:
    • Our relevance-based method is different from other traditional visualization methods, which can manifest the cross-modal visualization for the attention model. We will include GradCAM maps for CNN models and relevance maps for MBT model to enable qualitative comparisons across models in our future work.
  5. Determination of Model depths [R1]:
    • Based on the baseline(ViT) having 12 layers, our network is designed with a total depth of 12 with each encoder set to a depth of 3 for standard comparison. This design surpasses other models on the private dataset. Given the result and space limitation, we omit to extensively explore the depths of different modules.
  6. The role of the BCA module [R1&R2]:
    • Overfitting & domain shift [R1]: BCA effectively reduces overfitting on GAMMA datasets and addresses domain shift. Re-alignment in small datasets helps mitigate overfitting and adapt to domain variations(device-to-device).
    • Why BCA brings AUC decrease [R2]: Due to label imbalance, AUC is less informative than AP. However, the improvement of AP is trivial. This may be because the increased data volume satisfies the modeling capacity, leading to saturated performance even only with representation and fusion. Moreover, classifying disc hemorrhage cases is challenging due to the limited amount of such cases in our dataset. We plan to augment the dataset for further performance improvement.
  7. In-depth discussion on the potential avenues for future research [R1]:
    • Addressing the issue of confident(uncertainty) measurements in a multi-modal setting is vital. It is crucial to prevent bias from any specific modality from influencing the overall decision-making process, especially when diagnosing glaucoma using OCT for its limited specificity. Cross-modal uncertainty measurement is a possible direction.
  8. Other:
    • Block Embedding [R1]: It is designed for the OCT branch. Instead of treating the entire volume as a multi-channel image or each slice as a single image, we divide the volume into blocks for patch embedding, which reduce computational cost and improve accuracy. We will provide a clearer explanation in the camera-ready version.
    • Illustration of inductive bias in CNN [R2]: While MM-RAF is a fully transformer-based network and not readily feasible to a CNN backbone, we acknowledge the importance of investigating the inductive bias of CNN. We plan to replace the backbone of MM-MIL with a ViT backbone to re-evaluate the impact of the inductive bias.
    • 5-fold cross-validation [R2]: We recognize the significance of K-fold cross-validation and will include a comprehensive analysis.



back to top