Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Mohammad Atwany, Mohammad Yaqub

Abstract

Domain Generalization is a challenging problem in deep learning especially in medical image analysis because of the huge diversity between different datasets. Existing papers in the literature tend to optimize performance on single target domains, without regards to model generalizability on other domains or distributions. High discrepancy in the number of images and major domain shifts, can therefore cause single-source trained models to under-perform during testing. In this paper, we address the problem of domain generalization in Diabetic Retinopathy (DR) classification. The baseline for comparison is set as joint training on different datasets, followed by testing on each dataset individually. We therefore introduce a method that encourages seeking a flatter minima during training while imposing a regularization. This reduces gradient variance from different domains and therefore yields satisfactory results on out-of-domain DR classification. We show that adopting DR-appropriate augmentations enhances model performance and in-domain generalizability. By performing our evaluation on 4 open-source DR datasets, we show that the proposed domain generalization method outperforms separate and joint training strategies as well as well-established methods.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_61

SharedIt: https://rdcu.be/cVRst

Link to the code repository

https://github.com/BioMedIA-MBZUAI/DRGen

Link to the dataset(s)

https://kaggle.com/c/diabetic-retinopathy-detection

https://kaggle.com/c/aptos2019-blindness-detection

https://www.ias-iss.org/ojs/IAS/article/view/1155


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper explores domain generalisation for the task of classifying grading (0 to 4) of diabetic retinopathy from retinal fundus scan.

    The proposed method is to average model weights identified at particular iterations of the training. An additional loss is added to reduce the covariance of the gradient across datasets.

    4 datasets are used, with a leave one dataset out for testing.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The experimental design is sound (leave one dataset out for testing), the method is explained well and the paper is clear. The hyperparameters are reasonably well adjusted.

    Honest results showing heterogeneous improvement across dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No major weakness, however the results are a bit underwhelming. Performance on 3 datasets is clearly improved, whereas the performance on the 4th one, is markedly decrease. As a results the overall performance is only slightly above baseline.

    This is noted in the discussion but no explanation of why this is happening is provided. It would be worth investigating a few hypothesis.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Enough details are provided for reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    It would be great to at least speculate why is the performance so uneven across the dataset. Even better to conduct experiment to investigate this further. For example the dataset with degraded performance is the smallest one (with no grade 4). It should be easy to check whether the relative size of a dataset or its imbalance are linked to their performance improvement, or lack thereof.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Interesting paper but analysis could have been more informative.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    In this paper, the authors address the problem of domain generalization in Diabetic Retinopathy (DR) classification. The baseline for comparison is set as joint training on different datasets, followed by testing on each dataset individually. The authors therefore introduce a method that encourages seeking a flatter minima during training while imposing a regularization.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method is easy to implement.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The motivation behind the proposal is unclear.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. The motivation to develop such a regularization term should be described in detail.
    2. The theoretical guarantee of the influence of regularization term on the model training process should be presented.
    3. The comparison experiments are not enough to prove the superiority of the proposed method.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Theoretical analysis and experimental results are both insufficient.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The authors address the problem of domain generalization applied to retinopathy classification. The proposed method is build on Fishr regularization and the generalization capability is shown using 4 dataset with different dimensions. The averaged results show an improvement vs. SOTA of ~1%. The authors plan to share the github repository containing the source code for reproducibility.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is clear and well structured. The description in the implementation details is helpful to understand the precise settings of the experiments. The manuscript addresses a precise problem and evaluate the method with appropriate comparison to baselines and the closest method, Fishr. It is the first time that this method has been applied on the chosen dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I see 2 points as main weaknesses:

    1. it is not clear to me the contribution of this work with respect to the method. It is clear that the existing method was not yet applied to the chosen dataset, but what is exactly the difference vs. Fishr? is it the usage of stochastic weighted averaging? if it is, then this is not a new method, rather a variation of Fishr.
    2. the standard deviation measure is misleading. In this case the standard deviation quantifies the difference in performance for different testing datasets, however here it is not clear to me why a successful method should have less variability, as the different datasets vary in distribution and size. A standard deviation measure should be given to any result by re-running the experiments with different pseudo-random numbers, in order to quantify the improvement vs. variability; not quantifying the variability of one experiment across different settings.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The implementation details together with the source code used to run the experiments, which the authors plan to share, address reproducibility. Assuming that the source code is going to contain a readme that explains how to run the code to get the same results of the experiments, reproducibility is fulfilled.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Table2: would be good to keep absolute numbers only in the total images column and express the remaining columns as percentage. In this way the comparison of class distribution is easier for the reader.

    page5: coefficient gamma is not precisely defined, a precise definition would help the reader to avoid continuing comparison with the related work that present Fishr regularization.

    page5: “We adapted Fishr [24] loss to enforce invariance based on the difference in co- variance matrices as represented in equation 2” I find the same equation in Fishr paper (eq. 4) what has been adapted?

    Table5: Table 3 and 5 should use the same labels, i.e. instead of column Dataset Name, Testing Dataset

    Table4: as above, the column Accuracy is misleading, should be average accuracy.

    page8: in the discussion no possible explanation is given to the difference in performance Fishr - proposed method for Messindor. why has the proposed method for Messindor the lowest accuracy, if it was for Fishr the dataset with larger gains?

    page8: in the discussion it should be given more importance to give possible explanations on why the improvement vs. Fishr is measured. I can only find “This is can be attributed to seeking a flatter minima empirically”, this should be explained in more detail in my opinion to make this work useful to the reader.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Even if this work is not about a completely novel method in domain generalization, it shows to improve an existing method and it is the first time it is applied to the chosen dataset. Therefore, I believe that this work is interesting for the community, provided the comments above are addressed in the manuscript.

  • Number of papers in your stack

    2

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    TBA

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6




Author Feedback

We thank the reviewers (R1-3) for their constructive feedback and comments and therefore address the following concerns.

R2: Regarding the motivation behind regularization and its influence. As we present in the introduction, part of the contribution is proposing a method that “utilizes flatness in domain generalization training for iteration-wise weight averaging, coupled with domain-level gradient variance regularization.” Therefore, the motivation is focused on utilizing the regularization term rather than developing it. Regarding its influence, it mainly focuses on reducing gradient invariance in the classifier’s weights. For which the model is initially allowed to train up until the warmup iteration before the regularization is activated. This eventually enables the model to first focus on learning predictive features then shift focus towards gradient invariance to reduce the domain shift. We shall ensure that this is clear in the updated paper.

R2: Regarding the comparison experiments. As shown in section 5, consistent with our contribution, we aim to introduce the consistent learning of Diabetic Retinopathy classification under the DomainBed genre. Therefore, the results for joint training are leveraged as a baseline. Further ablation on stochastic averaging and/or Fishr alone aim to demonstrate the empirical benefits of the weight averaging and regularization coupling. We thus report out-of-domain accuracy for the 4 most common datasets in the domain to ensure sufficiently fair comparison.

R2, R3: Approach to the methodology/novelty. The scope of this work is not to develop an entirely new DG method but rather investigate existing SOTA methods to be applied to the challenging task of DR classification. However, to our knowledge, this is the first work that introduces the use of Fishr regularization and stochastic weighted averaging to leverage their benefits for a more generalizable classifier. With this combination, we showed improvement of the classifier generalizability compared to SOTA.

R3: Use of standard deviation when reporting average accuracies in results section. We agree on the point that smaller standard deviation does not necessarily mean a better performing model, but in this context it is used as a supporting point towards a more consistently performing algorithm. Also, since the objective is to observe the out-of-distribution accuracy, the stable tradeoff between densely learning predictive features and gradient invariance can be observed by the accuracy of the model during inference on different testing datasets. In this context, lower variability for the proposed method signifies better achieved feature/gradient invariance.

R1, R3: Regarding linking the performance of each dataset individually based on its respective size or class imbalance. Feature/gradient invariance balance was discussed in the previous point and establishes the reasoning for the Messidor dataset testing results. This dataset 1) is the smallest, 2) has a large class imbalance, and 3) has a complete absence of Grade 4 DR i.e., no class 4, see table 2. Also, there exists a scarcity of Grade 4 images in the other datasets as well, with percentages of around 2%, 8% and 2% for EyePACs, Aptos and Messidor 2 datasets, respectively. So when the Fishr alone is trained on the three other datasets, it becomes biased towards Grades 0, 1, 2 and 3, for which Messidor is completely comprised and therefore achieves a high average testing accuracy for Messidor.

Fishr alone in the setting above focuses on reducing the domain shift through gradient invariance but not necessarily the class imbalance effect. Therefore, when dense weight averaging is applied, the model looks to generalize better across all 4 datasets and mitigate class imbalance. This then reduces the average testing accuracy on Messidor, due to the absence of Grade 4. We will make these points clearer in the discussion.

Minor R3 comments will be addressed as well.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors address the problem of domain generalization applied to retinopathy classification. The experimental design is sound, the method is explained well, and the paper is clear. In addition, the hyperparameters are reasonably well adjusted. In the preliminary review, reviewers hope to see more theoretical analysis and discussion of experimental results. In the Rebuttals, the authors do a good job of answering and convincing reviewers. Hence, I recommend accepting this submission.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have clearly clarified the issues raised by the reviewers. The explanations in the rebuttal look reasonable and correct to me.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    comments nicely adressed in the rebuttal. highly relevant topic for the community.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    upper mid-field



back to top