Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Ching-Hao Chiu, Hao-Wei Chung, Yu-Jen Chen, Yiyu Shi, Tsung-Yi Ho

Abstract

Fairness has become increasingly pivotal in medical image recognition. However, without mitigating bias, deploying unfair medical AI systems could harm the interests of underprivileged populations. In this paper, we observe that while features extracted from the deeper layers of neural networks generally offer higher accuracy, fairness conditions deteriorate as we extract features from deeper layers. This phenomenon motivates us to extend the concept of multi-exit frameworks. Unlike existing works mainly focusing on accuracy, our multi-exit framework is fairness-oriented; the internal classifiers are trained to be more accurate and fairer, with high extensibility to apply to most existing fairness-aware frameworks. During inference, any instance with high confidence from an internal classifier is allowed to exit early. Experimental results show that the proposed framework can improve the fairness condition over the state-of-the-art in two dermatological disease datasets.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43898-1_10

SharedIt: https://rdcu.be/dnwAH

Link to the code repository

https://github.com/chiuhaohao/Fair-Multi-Exit-Framework/tree/master

Link to the dataset(s)

https://challenge.isic-archive.com/data/#2019

https://github.com/mattgroh/fitzpatrick17k


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a framework for fair diagnosis of skin lesion images based on a multi-exit (ME) strategy from deep neural networks. Building upon the hypothesis, supported by existing literature, that as we go deeper in a network, we obtain a higher classification accuracy in exchange for reduced fairness, this paper uses an early exit approach that allows an instance with high confidence of prediction to exit early. Quantitative results on 2 datasets and 2 network architectures show that the ME approach improves the fairness of classification while also improving the diagnosis performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper addresses an important problem in automated skin lesion diagnosis - fairness to sensitive attributes: gender and skin tone.

    2. The proposed method is interesting and is supported by existing literature on network overthinking and early exiting while making predictions.

    3. The authors provide a detailed comparison to multiple baselines, including a popular method, FairPrune, and show that their method improves the fairness of diagnosis with the added advantage of improved diagnosis performance itself.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. How is gender a sensitive attribute for ISIC 2019? Unlike Fitzpatrick-17k which contains clinical images, ISIC 2019 contains dermoscopic images, which are close-up images of lesions acquired using a dermatoscope with a very narrow field-of-view. These images would have very little information to suggest the lesion belongs to a particular gender. The authors provide no references to support their claim that gender is a source of bias for ISIC 2019. Since this work appears to mainly compare to FairPrune, please note FairPrune also doesn’t explain why gender is a sensitive attribute for ISIC 2019. In fact, several of my criticisms (#1, #5, #6, #8) are applicable to FairPrune too, as pointed out by Reviewer #2 (https://conferences.miccai.org/2022/papers/207-Paper2316.html) but were not addressed.

    2. The paper says SNNL “can serve as a proxy for analyzing the degree of fairness in a model”, and that high SNNL means the entangled features are indistinguishable leading to fairer performance. This is not very clear, but almost the entire paper relies on this hypothesis (based on Fig. 1 in the Supplementary).

    3. What’s the loss function used for training? The authors only say “the conventional classification loss, l_t,” without specifying the exact loss.

    4. Sec. 3.1, X = x \in … and Y = y \in … is incorrect. X and Y can’t be same as x and y respectively. Instead, they should be sets comprised of input features x_i and target classes y_i. Similar error appears in the A = a \in … statement.

    5. The authors say they follow FairPrune for train, valid, test splits, which states that they “randomly split the dataset” in 60:20:20 ratio. Was the split completely random? Why were the splits not stratified w.r.t. both target label and sensitive attribute? What were the proportion of target labels (disease classes) and sensitive attributes (gender, skin tone) in the train, valid, test splits? Without this information, it’s difficult to assess if the models have been evaluated correctly.

    6. It is unclear why privileged groups were female and dark skin. Is it only because these groups achieved “higher accuracy by vanilla training”?

    7. In Sec. 5.2, when talking about the results of ME-ResNet18 in Table 3, the authors write “our framework achieved the same level of fairness as FairPrune in Table 1”. I do not see how. The (Eopp0, Eopp1, Eodd) values for ME-ResNet18 in Tab 3 are (0.006, 0.031, 0.016), whereas for FairPrune in Tab 1, they are (0.007, 0.026, 0.014), showing that ME-ResNet18 performs worse for 2 of the 3 metrics being reported.

    8. The authors mention twice that their method “can significantly improve” fairness, but have not performed any statistical significance tests.

    9. Sec. 5.1, “… the classification through shallower, fairer features further improves fairness”. There are no references to support that shallower features are fairer; the paper shows that classifiers trained on shallower features are fairer. These are two different things.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of this paper is quite low. There are several key details that are missing (see weaknesses): the loss function, the number of samples in and the class-wise distribution of training, validation, and testing sets, the number of training runs and if the experiments were repeated.

    In the reproducibility checklist, the authors have replied “Yes” to several items that are not present in the paper:

    1. “A clear declaration of what software framework and version you used.” -> missing.
    2. “Information on sensitivity regarding parameter changes.” -> missing.
    3. “The exact number of training and evaluation runs.” -> missing.
    4. “The details of train / validation / test splits.” -> missing.
    5. “A description of results with central tendency (e.g. mean) & variation (e.g. error bars).” -> missing.
    6. “An analysis of statistical significance of reported differences in performance between methods.” -> missing.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Aside from addressing the points mentioned in the weaknesses, please consider addressing the following:

    1. For Fig. 2 (b), please consider labeling the bars with the respective values to allow for a quantitative comparison of the early exit policy with CLF_f, CLF_4, and CLF_3.

    2. There is a typo in Fig. 1, it should be “Fairness Constraint” instead of “Fairness Constrain”.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper addresses an important problem and the results show that the proposed multi-exit method helps alleviate the fairness problem, there are several missing details, lack of clarity about the experiments, and insufficient support for some of the claims.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    After having read the author feedback, I think the authors have answered several of my questions, and I believe that the changes they have agreed to make will improve the paper.

    However, the authors write in the rebuttal that they used stratified splits for data partitioning, but details about the splits are missing. In the rebuttal, they write that their method is “evaluated using publicly available datasets with the settings of existing work [21]”. Wu et al. [21] do not provide this detail in their paper either and neither any further implementation details nor their code are available.



Review #2

  • Please describe the contribution of the paper

    The paper introduces a multi-exit training framework with the goal of improving fairness in terms of selected sensitive attributes. The multi-exit framework operates by utilising features from the neural network’s early layers, which, while less discriminative, are less biased to sensitive attributes. The authors test their method on two dermatology disease diagnosis datasets, ISIC 2019 (with gender as the sensitive attribute) and Fizpatrick-17k (skin tone as the sensitive attribute). To evaluate their experiments, the paper employs class equalised opportunity and equalised odds to measure fairness. The results show that using the multi-exit framework in conjunction with other fairness methods such as HSIC and MFD performed better in terms of fairness than using the method without the multi-exit framework. The combination of the multi-exit framework and FairPrune produced the best results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The concept of using a multi-exit neural network architecture for bias reduction and improve fairness is novel and the method performs well especially when combined with other methods.

    The paper investigates how the features at different points in a model’s architecture effect the bias of the model’s predictions and show experimental results for this in the supplementary materials.

    The multi-exit framework is clearly presented and is easy to follow and it can be easily seen how the framework can be applied to other architectures and used in conjunction with other methods.

    The paper uses multiple measures of fairness to evaluate the method along with other standard classification metrics in their experiment results so readers can see how different methods impact the fairness and general performance of the predictions.

    An ablation study is presented and shows the results of additional experiments with the methods such as the effect of difference confidence threshold and early exits, this gives a deeper understanding on the multi-exit framework.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the paper mentions how the multi-exit framework can be done in a multi class sensitive attribute setting (such as Fitzpatrick skin types) the experiments only use the binary setting, It would have been more interesting to see how this would perform as this would be very common in clinical applications.

    Furthermore, it would have been interesting to see how the multi-exit framework would work with multiple sensitive attributes at the same time, as when used in a clinical setting, a model must be fair against multiple sensitivity attributes at the same time.

    There could be more discussion about the impact of the multi-exit framework on classification performance, as the results appear to be highly variable in terms of how the multi-exit framework affects the F1 score.

    The experiments could have been repeated with different data splits and model initialization parameters. Averaging the results of these runs could aid in determining the model’s stability.

    It would also be beneficial to see more complex neural network architectures used in the experiments to see how this model interacts with them as they become more prevalent in the state of the art.

    Comparisons of the time required to train and infer the model with and without the multi-exit framework would be helpful in determining the computational resources required.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors are conducting a computational study with datasets from the public domain so all experiments should be fully reproducible. The authors claim that all of the information required to replicate the experiments is available. It would be helpful to include a link to where this can be found in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The effect of the multi class framework has on the discriminative performance of the the diseases and the trade off with fairness should be discussed more in the paper as the results are present in the paper but not discussed.

    The experiments could be repeated multiple times and the results averaged (with standard deviation) to show the consistence as currently the results shown in the experiments only show a slight improvement and by repeating the experiments would show if the multi exit framework improve is consistent.

    There could be more to address the computational cost in using the model in terms of training and inference to show the added overhead that comes with the multi exit framework.

    For future work experiments in other medical image analysis settings could be done as this method could be applied to any type of medical imaging along with extending the sensitive attributed from binary to something multi class (as mentioned previously Fitzpatrick skin types).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The model has some novelty but the method only shows a slight improvement against other state of the art methods and the experiments presented are limited so it is unclear if the method would show a consistent improvement in all settings. But the method presented in novel in its application to bias reduction and is well written and clearly explained.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    While I appreciate the detailed replies to R1 I don’t think the comments on the experimental setup were suitably answered. While I think this method has merit I think this paper is partly let down by its experiments and evaluation.



Review #3

  • Please describe the contribution of the paper

    The authors propose a novel, general approach for mitigating performance differences between patient groups: the use of a (previously proposed) multi-exit framework together with a loss function that includes a fairness regularization term. In a nutshell, the multi-exit framework allows the model to “exit” the inference process already in one of the earlier layers if confidence is already high enough. The underlying hypothesis is that later layers contain more group-specific information, thus potentially leading to larger disparities. Exiting early may help alleviate this problem, while still allowing for necessary group adjustments. The proposed framework is evaluated on two skin lesion classification datasets (ISIC2019 and Fitzpatrick-17k), where it performs favorably both in terms of overall model performance as well as in terms of performance disparities between groups.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed unfairness mitigation approach is novel and interesting. Its merits include the facts that 1) it is quite broadly applicable to different model architectures, 2) it prevents unnecessarily group-specific predictions while still allowing for learning important differences between groups, and 3) it does not require access to the protected attributes during inference time and, in one version, not even during training time.

    The evaluation is reasonably comprehensive and includes a comparison to a number of relevant baseline methods. The manuscript also includes an ablation study that helps shed light on the efficacy of different components of the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    My main concerns are related to the quantitative evaluation. Here, I have a number of separate concerns.

    • All of the considered metrics depend on the selected decision threshold, and so a very important baseline would be a simple threshold optimization method applied to the vanilla baseline model, akin to the suggestions in Hardt et al. (2016). Also, how was the decision threshold selected in all the experiments described in the paper?
    • In addition, I would suggest the addition of a threshold-independent evaluation metric, such as AUROC or the area under the precision-recall-gain curve (cf. Flach and Kull, Precision-Recall-Gain Curves: PR Analysis Done Right).
    • It would seem very important to add some uncertainty measure (e.g., standard deviation, confidence interval, etc.) to all the performance metrics. (If the cost of retraining all the models multiple times is prohibitive, even simple test set resampling / bootstrapping can already provide some measure of uncertainty.)
    • Simple group balancing is another important baseline that should be added. See, e.g., Zhang et al., Improving the fairness of chest X-ray classifiers, and Idrissi et al., Simple data balancing achieves competitive worst-group-accuracy.
    • Especially as the paper is concerned with fairness, it would also be interesting to assess performance on a more diverse dataset, such as the Diverse Dermatology Images (DDI) dataset (see Daneshjou et al., Disparities in dermatology AI performance on a diverse, curated clinical image set). (I am well aware that this would required significant extra effort, of course, and I am not expecting it for the rebuttal period.)
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The study uses publicly available datasets, and the authors have indicated that all code required to run the experiments will be made available after publication.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    In addition to my main concerns outlined above, I have the following, minor remarks & suggestions.

    1. The abstract and introduction are hard to follow if one does not already know what the multi-exit framework is / how it works. I would suggest adding 1-2 clarifying sentences in both places.

    2. Similarly, the motivation section relies heavily on the SNNL but never clearly defines it.

    3. In a number of places, the manuscript lacks precision concerning terms such as “bias” and “fairness”. Some examples:
      • The authors write that “algorithmic bias has been found in dermatological disease datasets” - but datasets, by definition, cannot be the subject of algorithmic bias. When speaking about biases, it is crucial to be very precise in defining what exactly was found to be “biased”. How were the datasets biased?
      • The authors write that “bias can arise when there is an imbalance in the number of images representing different skin tones, which can lead to inaccurate predictions and misdiagnosis due to biases towards certain skin tones.” While this is true (group imbalance can lead to the suggested outcomes), it is neither true that imbalance necessarily leads to biases, nor that biases are necessarily the result of group imbalances.
      • The authors write that “pre-processing and post-processing methods have limitations that are not applicable to dermatological disease diagnostic tasks since they need extra sensitive information during the training time.” Firstly, why would methods that need access to protected attributes during training - such as the method the authors propose in this very manuscript! - not be applicable to dermatological disease diagnostic tasks? And, secondly, there are methods that do not rely on explicit group membership information. See, e.g., i) Hébert-Johnson et al., Multicalibration: Calibration for the (Computationally-Identifiable) Masses, ii) Martinez et al., Blind Pareto Fairness and Subgroup Robustness, and iii) Zhao et al., Towards Fair Classifiers Without Sensitive Attributes.
      • In various places, the authors write about “deteriorating fairness”. Again, I would suggest being precise, and replacing “fairness” by the specific metric that is increasing/decreasing. “Fairness” will mean very different things to different readers.
    4. Given that the authors suggest that group imbalance might be the root cause of the observed disparities, what is the relative representation of the protected groups in the two considered datasets? Does underperformance correlate with group representation in these cases?

    5. Which fairness regularization do the authors use in their experiments? They write about “a fairness regularization loss, ls, such as the Maximum Mean Discrepancy (MMD) [9] or the Hilbert-Schmidt Independence Criterion (HSIC)”, but as far as I can see, they never specify which exact loss is used in the end? This is certainly important for the interpretation of the results.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed approach is interesting and promising, the paper is reasonably well-written, and the evaluation is acceptable. If the authors can alleviate at least some of my concerns regarding the evaluation - adding at least some of the suggested simple baseline methods, additional metrics, and/or uncertainty quantification - I believe this will be a very good and interesting contribution.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Trying to achieve fairness in image classifiers by using a multi-exit/early-exit paradigm to avoid decisions based on the potentially biased image representations in deeper layers of a CNN whenever possible is an interesting idea. The authors motivate this approach quite well with support of literature on “overthinking” in CNNs.

    All reviewers agree that the use of a multi-exit neural network architecture for bias reduction/fairness improvement is novel. It was also highlighted that the technique is in principle architecture-independent and does not require access to the protected attributes during inference. The evaluation on the ISIC2019/Fitzpatrick-17k involves several relevant baseline approaches and the proposed multi-exit method outperforms them not wrt fairness measures while also achieving high classification accuracies.

    The major weaknesses identified by the reviewers are related to missing details of the method (e.g., missing loss function, definition of SNNL), concerns about the setup of the experiments (e.g., choice of gender/dark skin as attributes tested, only small architectures are considered, no bootstrapping/measurement of the uncertainty, …), and lack of support of the claim that the proposed method outperforms the baselines significantly as no statistical tests were done.

    The majority of suggestions provided by the reviewers on improving the paper would lead to new experiments, which cannot be added at this stage of the MICCAI reviewing/submission process. I would, therefore, ask the authors to focus on clarifying the aspects mentioned above with a specific emphasis on justifying their experimental setup and why/how the results it shows the claimed advantages of the proposed method. This should especially include a justification for the choice of gender/dark skin as the protected attributes (see concerns raised by R1).




Author Feedback

We sincerely appreciate the valuable suggestions from the reviewers and are committed to addressing the major concerns and misunderstandings raised. #R1&#R3 Q1. Clarification for the SNNL. A1. Wang et al showed that if a model cannot differentiate the sensitive attributes of the representations, the model will generate fairer results. Motivated by reference [5] in the paper, which proposed the concept of SNNL, and showed that it can be used to measure the similarity between representations of different classes, we extended the concept to measure the similarity between representations of different sensitive groups. As such, high SNNL suggests that the representations from different sensitive groups are similar, and therefore, the model could not differentiate them. Wang, Zhibo, et al. “Fairness-aware adversarial perturbation towards bias mitigation for deployed deep models.”(2022) Q2. Loss function used for training. A2. In Section 3.2, we stated that our loss is obtained through a weighted sum of each CLF’s loss, where l_t and l_s represent the cross-entropy loss and the fairness regularization loss of the respective method, respectively. For example, in ME-MFD, l_s would be L_MFD (Eq.6 in [9]). #R1 Q3. The choice of protected (sensitive) attribute. A3. Chen et al and several other works have shown that even with sensitive attributes that are not observable in the input, models can still exhibit bias in their predictions due to the very attribute. We therefore chose two datasets, one with sensitive attributes clearly observable (skin color in Fitzpatrick-17k) and one not observable (gender in ISIC2019). Our experiments show that various models in fact are biased w.r.t. gender for ISIC2019. Chen, Jiahao, et al. “Fairness under unawareness: Assessing disparity when protected class is unobserved.”(2019) Q4. Details for data split? A4. We did use the stratified split approach in training/test as suggested by the reviewer. Q5. The use of the term “privileged group”? A5. Following most fairness works (Du et al), we name the groups with advantages or desired model outcomes, typically with higher accuracy, as privileged groups. Du, Mengnan, et al. “Fairness via representation neutralization.” (2021) Q6. Performance comparison between ME-Resnet18 and FairPrune? A6. We will modify the language to “comparable performance” to be more precise. On the other hand, this comparison is not important; our method can be applied to any backbone, and therefore, the most important results are the apple-to-apple comparisons as shown in Table 3, which shows that ME-FairPrune performs better than FairPrune. Q7. No statistical tests to show “significantly” . A7. We will delete “significantly” to be more rigid. Q8. The reproducibility of this paper is low. A8. Our method is evaluated using publicly available datasets with the settings of existing work [21]. In addition, we will release our code upon acceptance. #R2 Q9. The impact of the ME framework on F1. A9. Out of all the results, ME framework only leads to an increased difference in F1 for MFD backbone in ISIC2019 dataset. As the fairness metrics commonly used in the literature and adopted in this paper (Eodd, Eopp) are based on true positive/negative rate or false positive rate, and do not directly consider F1, this could happen. Q10. Add a more complex backbone. A10. Larger models beyond VGG19 easily get overfit on both datasets so we were not able to use them. Q11. Comparison of the time required to train and infer the model w/ and w/o the ME framework. A11. As our method uses the results from early exits instead of the final output, it will lead to a speedup in inference (similar to the observations in [11, 18]). As for training, only a few additional internal classifiers, each with 2 layers, are added, so the overhead is not significant.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    While the authors did not convincingly address all of the concerns raised by the initial reviews regarding their experimental setup, all reviewers now see the paper above the bar and I, therefore, recommend its acceptance.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper at hand presents a multi-exit training framework aimed at enhancing fairness in relation to selected sensitive attributes. The multi-exit framework uses features from the neural network’s early layers, which are less discriminative but less biased towards sensitive attributes. The framework was tested on two dermatology disease diagnosis datasets, ISIC 2019 (gender as the sensitive attribute) and Fitzpatrick-17k (skin tone as the sensitive attribute), using class equalized opportunity and equalized odds as fairness metrics. The results demonstrate that using the multi-exit framework alongside fairness methods like HSIC and MFD resulted in better fairness outcomes than when used without the framework. The combination of the multi-exit framework and FairPrune yielded the best results. The paper shines by addressing a crucial issue in automated skin lesion diagnosis, namely, fairness towards sensitive attributes like gender and skin tone. The proposed method, supported by existing literature on network overthinking and early exiting, offers an intriguing approach. A detailed comparison with multiple baselines, including FairPrune, shows that their method enhances the fairness of diagnosis while also improving diagnostic performance itself. On the other hand, the paper lacks clarity in some areas. While the authors explained their usage of stratified splits for data partitioning, the specific details about these splits were not provided. It would be beneficial to see the framework applied in a multi-class sensitive attribute setting and how it performs with multiple sensitive attributes simultaneously, mimicking the real-world clinical scenario. A more detailed discussion about the impact of the multi-exit framework on classification performance is warranted, as the results appear to be highly variable regarding its effect on the F1 score. Further, the evaluation of the framework with more complex neural network architectures would have added value. Regarding the quantitative evaluation, the paper would have benefited from the addition of a threshold-independent evaluation metric and a measure of uncertainty for performance metrics. It would have been beneficial to see the addition of a simple group balancing baseline. Also, examining the performance on a more diverse dataset, like the Diverse Dermatology Images (DDI) dataset, would have provided insights on fairness. In their rebuttal, the authors have addressed these concerns to a large extent. They clarified the use of SNNL and the loss function for training. They have explained their choice of sensitive attributes and confirmed the use of stratified split for data partitioning. Furthermore, they have provided their rationale behind using the term “privileged group”, and they have agreed to make their language more precise regarding the performance comparison between ME-Resnet18 and FairPrune. Importantly, they also committed to releasing their code upon acceptance of the paper, addressing concerns regarding reproducibility. In light of the authors’ responses and the significant contributions of the paper, I recommend the acceptance of the paper. The authors have demonstrated a solid understanding of the subject and have provided satisfactory responses to the issues raised.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Pros:

    • proposes a method to segment and reconstruct retinal vasculature in 3D OCTA images, providing richer spatial distribution information than 2D segmentation.
    • simplicity and usefulness for application in ophthalmology. Cons:
    • innovation, methodological details and experimental validation.
    • substantial methodological flaws. After Rebuttal: +reviews are more consistant and positive; +major issues are well explained -no strong support is received



back to top