Authors

Haoxuan Che, Yuhan Cheng, Haibo Jin, Hao Chen

Abstract

Diabetic Retinopathy (DR) is a common complication of diabetes and a leading cause of blindness worldwide. Early and accurate grading of its severity is crucial for disease management. Although deep learning (DL) has shown great potential for automated DR grading, its real-world deployment is still challenging due to distribution shifts among source and target domains, known as the domain generalization (DG) problem. Existing works have mainly attributed the performance degradation to limited domain shifts caused by simple visual discrepancies, which cannot handle complex real-world scenarios. Instead, we present preliminary evidence suggesting the existence of three-fold generalization issues: visual and degradation style shifts, diagnostic pattern diversity, and data imbalance. To tackle these issues, we propose a novel unified framework named Generalizable Diabetic Retinopathy Grading Network (GDRNet). GDRNet consists of three vital components: fundus visual-artifact augmentation (FundusAug), dynamic hybrid-supervised loss (DahLoss), and domain-class-aware re-balancing (DCR). FundusAug generates realistic augmented images via visual transformation and image degradation, while DahLoss jointly leverages pixel-level consistency and image-level semantics to capture the diverse diagnostic patterns and build generalizable feature representations. Moreover, DCR mitigates the data imbalance from a domain-class view and avoids undesired over-emphasis on rare domain-class pairs. Finally, we design a publicly available benchmark for fair evaluations. Extensive comparison experiments against advanced methods and exhaustive ablation studies demonstrate the effectiveness and generalization ability of GDRNet.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_42

SharedIt: https://rdcu.be/dnwHn

Link to the code repository

https://github.com/chehx/DGDR

Link to the dataset(s)

N/A

Reviews

Review #2

Please describe the contribution of the paper

The authors of this paper deal with the problem of domain adaptation in diabetic retinopathy grading. They propose a unified framework, where the output of one component is the input to the second component, that takes into account visual style shifts, diagnostic pattern diversity, and categories imbalances across diverse domains. These modifications were made possible by introducing new loss functions and weighting functions into a basic backbone model. Additionally, they propose a new experimental setting to test their approach against other works in the literature. The new experimental setting, the extreme single-domain generalization setting, is based on train-one-domain protocol and testing on the rest, but it augments the testing set with two extra large datasets. The method was tested in 8 databases, and compared to the existing work it demonstrates considerable better results than other methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) Evaluation: Strong evaluation of the approach using extreme single-domain adaptation. Also, two extra datasets with higher scale of date are used for evaluation. 2) Methodology: While existing works focus on the style differences between domains, the authors here further analyze the source of generalization error to three different components, and they unify the different components in a single framework. For learning general discriminating features for the pathologies they adopt existing work on contrastive loss from the literature to learn intra-class features in the same domain. While rebalancing, depending on the domain, is done by weighting according to the representation of each class in the domain.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1) The paper suffers from clarity issues in different sections. In the methodology, in the first step of the data augmentation the images are downsampled and degraded by adding spots or creating holes, there might be cases that a lesion (e.g. microaneurysm in R1 stage) get obstructed and this might reduce the confidence in the diagnosis by an expert. Additionally, the diagram in Fig. 2 is very confusing with too much information. The algorithmic components, a flow diagram, and the function of each component together with their effect are tried to be fitted all together in the same diagram unsuccessfully. In the experiment section and in tables 1,2 the authors do not provide the standard deviation of the metrics, and also the statistical significance analysis of their results. The ablation study on the proposed component under the extreme single-domain adaption is missing. Furthermore, on this point and according to my understanding when you have a single domain to learn features, the domain-class-aware re-balancing weighting is eliminated, and you end up with a classical grading model.

2) The authors do not discuss the limitations of their method, and they do not give future research steps.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The training hyperparameters are given. All the datasets are public and their references are given. However, more details on the software framework (Torch, Tensorflow, etc.) and version was not given, and the code to the data augmentation module (FundusAug) which is very crucial and can affect the whole pipeline is not given. Overall, the implemented ideas are relatively simple and someone can implement them, still important details are missing to reproduce the exact results of the method.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

1) Please modify Figure 2 and simplify it. It is not clear if the top part is the fundus augmentation or the input to the 1st model. Please add labels in the figure about which section corresponds to which part. It is better to separate the diagram to individual simpler diagrams, each one representing a different section.

2) It is not clear what are the soft-weights in the domain-class-aware re-balancing (DCR) that are mentioned.

3) How the diagnostic patterns differ in the considered databases, for example some databases follow the American Ophthalmological Society grading system, and some the British? Could the method address cases where the pattern is very different and some grades are merged? For example patterns of DR/No-DR classification or Severe NPRD and PRD grades merged.

4) In the augmentation, could the halo simulation represent an eye with cataract?

5) In the description of Figure 2 losses are missing (Lsup, Dahloss). Also what is class in the figure?

6) I understand that ResNet50 is a standard model, but not a recent one. How the method could perform with more performant architectures like DenseNet, ResNext, or even more layers and Transformer like architectures, it will not learn better representations?

7) In the results and tables 1,3 what does the underline scores denote?

8) Do you take into account the inter-eye correlation of diabetic retinopathy?

9) What is the computational time for inference for the method against the other method that were considered?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors refine existing work on domain shift, and extend the methodology by introducing two additional learnable components that take into account differences in the diagnostic pattern and class imbalance between domains. They also extend the experimental setting to test domain adaptation methods by proposing a method to train on single domain and test to the rest. However, the paper needs more work on the clarification of the different aspects of the components, and the results.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper presents a generalizable DL-based DR grading framework for unseen domain. The key contributions include data augmentation, dynamic hybrid-supervised loss and domain-class-aware rebalance. Extensive validation is performed showing superior results compared to SOTA.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is easy to follow and well organized.
- The dynamic hybrid-supervised loss is interesting, which contributes a lot to the domain adaptation.
- The ablation studies is extensive and show the values of each components.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- In Table 2, it seems that VT and L_dhl bring the most improvement compared to the baseline ERM model. Also, DCR seems not contributing much to the performance. It’d be great to learn more thoughts from the authors on it.
- There is no dedicated section to explain how to set the different parameters.
- It would be also interesting to learn how different demography could impact on the domain shift.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No code is available. The details of implementation is not clear.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

See the weakness above
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well written with methodology clearly explained. The validation is extensive.
Reviewer confidence

Somewhat confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

The authors propose a new methodology that includes three generalization issues to offer better results in applied problems, in this case diabetic retinopathy.

The manuscript is well written and the method seems to be interesting, demonstrating adequate results including ablation and comparative experiments, being validated in public datasets of reference.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The method is well designed, including different parts that improves the behaviour in the applied domain it is tested. The application in Diabetic Retinopathy is of maximum interest in ophthalmology. The experiments included many public datasets of reference, very positive. The authors compare the obtained results with works from the state of the art, leaving the proposal in a good position. The experiments include an ablation study to demonstrate the positive impact of each part. The experiments also include study for the generalization from a single domain.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Despite the positive impact and the good results, the novelty of the proposal seems to be simple, with a complete data augmentation, an adjusted loss and a direct readjustment of balancing the classes.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The manuscript include sufficient details for the reproducibility of the work. Public datasets were analyzed in the experiments.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

My reservations rely on the novelty of the three designed proposals:

The first proposal implies a data augmentation or something else?

The proposed loss, despite the good results, I have no clear how it achieves what is said, the balance between the intra-class variation and the inter-class variation in learned feature representations. Better and clearer explanation about this should be revised as it is the motivation o the new loss.

The domain-class-aware re-balancing seems to be just a weighted of the samples by using class numbers. Is it all?

The methods offers good behaviour and adequate results, but the method is quite simple in novelty.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Given the strong experimentation, ablation study, comparative with the state of the art, and the good results, that motivates my decision.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

All reviewers agreed on the decision of ‘accept’. Nevertheless, the authors should revise the paper based on the detailed comments (especially from Reviewer #2) for the final version.

Author Feedback

Dear Reviewers,

We express our sincere gratitude for your time and effort in reviewing our manuscript. Your insightful comments and constructive criticisms are greatly appreciated. We will address each issue in turn.

FundusAug’s pipeline, novelty, and impacts on diagnosis confidence (R2, R3): It employs a series of operations to simulate realistic augmented views, capturing potential domain gaps. The pipeline and parameters are detailed in Fig 2 in the appendix. its novelty lies in its unique benefits: a) leveraging image degradation for image augmentation; b) serving as a parameter-free plug-and-play component for any fundus tasks; c) creating realistic augmented views that, in conjunction with L_dah, capture lesion-aware information. It aims to maintain semantic consistency while using dah to compare weakly augmented views with FundusAug augmented views.

Ability of FundusAug to simulate cataract (R3): Preliminary observations suggest that by generating a large halo and blurring the images, it can create images resembling an eye with cataract. However, its primary function is to simulate realistic augmented views with unchanged semantics.

Differences in diagnostic patterns, handling of merged classes, and potential impact of demography (R3, R4): Divergent lesion types, degrees, and areas have been observed across multiple datasets. It is closely related to the small data problem, resulting in datasets not covering all possible combinations of lesion appearances. Our method can help models in situations with merged classes, as it focuses on learning generalizable feature representations. As shown in Table 1 and 3, a significant generalization gap among datasets from different countries was observed. While we cannot definitively attribute this to demography, it is an interesting issue that may inspire future work.

Motivation of DahLoss (R2): It addresses the challenge of diagnostic pattern diversity by preserving lesion information and increasing feature representation variations. It uses a contrastive loss to compare augmented views, helping the model learn generalizable representations while preserving lesion/pixel-level information. It also encourages learning representations with sufficient intra-class diversity to resist the influence of unseen domains.

Technical novelty and motivation of DCR (R2, R3, and R4): An extreme imbalance among domain-class pairs was observed, as shown in Fig. 2. This imbalance can cause information from minority samples to be overlooked. DCR’s novelty lies in its consideration of domain-class imbalance and the use of beta to avoid overemphasis on a certain minor domain-class pair. Table 2 shows DCR contributes to 1 AUC score, which is significant. While DCR only marginally improve due to the functionality of L_dah, it still helps balance attention on different datasets, boosting overall performance and leveraging data properly.

Meaning of underlined scores, DCR in ESDG, inter-eye correlation, compatibility with advanced architectures, STD, significance analysis, ablation study on ESDG, different parameters study (R3): Underlined scores denote performance ranked as the second-best. DCR focuses only on the class if there is only one domain. We consider the single eye situation. Our framework is easily compatible with other architectures like transformers. These problems mentioned by reviewers will be explored in the future journal version, given the 8-page limitation.

Computational time against other methods (R3): Most compared methods use the same pipeline as ours, a backbone and a classifier at the inference time. Therefore, these methods should have similar inference times. However, since MixStyle, CABNet, and GREEN have specific modules, they may be slightly slower than ours.

Finally, we appreciate your valuable feedback and will work diligently to address the issues raised. We also commit to open-sourcing the code to ensure reproducibility. Thank you again for your time and effort.

back to top

Towards Generalizable Diabetic Retinopathy Grading in Unseen Domains