Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Ruolin Su, Xiao Liu, Sotirios A. Tsaftaris

Abstract

Rights provisioned within data protection regulations, permit patients to request that knowledge about their information be eliminated by data holders. With the advent of AI learned on data, one can imagine that such rights can extent to requests for forgetting knowledge of patient’s data within AI models. However, forgetting patients’ imaging data from AI models, is still an under-explored problem. In this paper, we study the influence of patient data on model performance and formulate two hypotheses for a patient’s data: either they are common and similar to other patients or form edge cases, i.e.\ unique and rare cases. This shows that it is not possible to easily forget patient data. We propose a targeted forgetting approach to perform patient-wise forgetting. Extensive experiments on the benchmark ACDC dataset showcase the improved performance of the proposed targeted forgetting approach as opposed to a state-of-the-art method.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_60

SharedIt: https://rdcu.be/cVVqh

Link to the code repository

N/A

Link to the dataset(s)

https://www.cs.toronto.edu/~kriz/cifar.html

https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper discusses the problem of machine unlearning of patient-wise data from ML models. A targeted forgetting approach is presented and evaluated on cardiac MRI data (and compared to a computer vision application).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This is an interesting topic and challenging problem from a machine learning perspective. The paper provides good arguments and a sensible approach for patient-wise forgetting.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The experimentation is limited to a single medical application which suffers from the problem of limited data. While this might be a particularly challenging example, as only less than a hundred patients are used for model development, the more interesting real-world applications are ML models trained on large-scale data (e.g., chest X-ray disease detection). The paper would have been stronger if such an application would have been explored.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Likely to be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Part of the problem of the presented application is the limited data, which makes almost every case an edge-case. It is unlikely that clusters emerge on high-dimensional imaging data when only a hundred samples are available. Neural networks will likely overfit to such data (they may still interpolate well in-between), but that means that the hypothesis of separating edge cases from cluster cases may not be valid. A more interesting application for this to be tested would have been image classification trained on 100k+ images (e.g., chest X-ray disease detection).

    I am not an expert on privacy and newer regulations such as GDPR. However, the described use case of an individual whose data would need to be removed from a trained ML model seems unlikely to be a legal requirement. The so called ‘right to be forgotten’ does not seem to apply here (see https://gdpr.eu/right-to-be-forgotten/). I would think that a trained ML model falls under the exemption stating “The data represents important information that serves the public interest, scientific research, historical research, or statistical purposes and where erasure of the data would likely to impair or halt progress towards the achievement that was the goal of the processing.” With this in mind, while the paper is thought provoking and stimulating, I am unsure about its practical relevance. In particular, ML models for production are typically developed on (fully) anonymised data where data privacy regulation such as GDPR does not apply.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I may argue for accepting the paper because it is thought provoking and may stimulate interesting discussions.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Convincing rebuttal. Adjusted my score accordingly.



Review #2

  • Please describe the contribution of the paper

    This paper addresses the problem of forgetting patient data in a DL model when for example patient consent is withdrawn. The problem is phrased as patient-wise forgetting, i.e. one patient (all images of the patient) is selected to be forgotten. They formulate two hypothesis: the patient’s data is similar to other data (common cluster hypothesis) and the patient’s data is different to other data (rare case, edge case). They show that the common cluster hypothesis holds often for computer vision data, while the edge case is more common in medical image data. They propose a new approach for forgetting edge-case patient data. The hypothesis and method is evaluated on CIFAR-10 and the ACDC dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This paper addresses an important aspect of AI in healthcare applications, and is of great interest to the MICCAI community.
    • The paper is well-written, well-structured and easy to understand.
    • The distinction between common cluster data and edge case data seems to be important and the comparison between computer vision and medical datasets is very interesting and relevant.
    • In general, the results and conclusions in this paper are convincing and relevant.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • I would have liked to see a more in-depth discussion on the implications of the findings of the paper. Is forgetting patient data a realistic method to deal with withdrawn patient consents? What is more important: respect data protection or ensuring model performances?
    • I’m wondering about the correct definition of when a dataset is correctly forgotten. Tabel 1 suggests that data is forgotten if the accuracy is 0.0. But isn’t an accuracy of 0.5 (random classification) better suited?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper evaluates the method on publicly available data. Information on the dataset split and training process is provided. The method is sufficiently explained.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    This is a well-written paper describing and addressing a very important aspect for AI in healthcare applications. The paper is easy to follow and the findings are relevant to the MICCAI community.

    I would have liked to see a more in-depth discussion on the implications of the findings of the paper. Is forgetting patient data a realistic method to deal with withdrawn patient consents? What is more important: respect data protection or ensuring model performances? What is the connection to differential privacy and can we learn something from it?

    If I interpret Table 1 correctly, a patient’s data is forgotten if the accuracy on this data is 0.0. I would assume that the model does not have any knowledge about this data of the classification accuracy is 0.5 (random). An explanation/definition is missing here.

    A minor comment is that the classification error is defined too late (only in the caption of Table 1). It should be introduced earlier in the text.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses an important aspect of AI in healthcare and is of great interest to the MICCAI community. Although I see some room for improvement, I recommend acceptance.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    I didn’t change my opinion of the paper after reading the author’s rebuttal and the other reviews. The paper has some limitations (the authors promised some changes, but this is hard to verify), but I still think that this is an interesting work to be presented at MICCAI.



Review #3

  • Please describe the contribution of the paper

    In this paper the authors propose an framework that can be used to forget a patient’s imaging data from AI models. The proposed approach divides patient data in two categories: edge cases and common cases. The authors claim that the proposed framework outperforms related work when edge cases need to be removed/forgotten.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Intuitive approach to the patient forgetting problem. Albeit the empirical validation issues that are highlighted later, the proposed idea (separation of patients in common and edge cases) is intuitive and has the potential to be a useful contribution to the field once properly validated.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The main difference of this paper compared to [9] is that the patient data is divided is two categories: edge cases and common cases. It is unclear how prevalent this division is for medical imaging data. The authors justify this with experiments in one dataset and even in this one dataset, no confidence bars are computed. With this level of empirical validation it is hard to assess the practical impact of the proposed method.
    • In the medical imaging dataset that was used, there is a large discrepancy between the test set performance (reported as 0.19 error) and the results reported in Fig 3. that show that “By considering a threshold of 50% on the error of the golden model, we find that > 60% of patients in ACDC can be considered to belong to the edge case hypothesis.”. The big difference in between these numbers and the test set performance indicate that the model training may have not been done properly (possible overfitting issues). This raises some additional concerns about the generalizability of the empirical results.
    • As a general remark the authors should include error bars in all their reported results, following the standards of paper [9].
    • For the medical imaging datasets, the authors should use an NN architecture that has close to state-of-the-art performance. It is unclear whether this is the case with the architecture that was employed.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Sufficient information to reproduce the results is provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    This paper presents a framework for the patient forgetting problem that is an intuitive extension of what has been proposed in the literature, however the empirical validation is insufficient to illustrate it practical usefulness.

    Detailed recommendations for improvement are provided in the “paper weaknesses” part of the review.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Weak empirical validation. Detailed recommendations for improvement are provided in the “paper weaknesses” part of the review.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This is an interesting paper assessing the ability of deep learning models to forget patient data e.g. if consent is withdrawn. The paper builds on the computer vision investigation of reference [9], but makes a comparison between computer vision and medical imaging datasets and finds that these have different properties when it comes to forgetting data points.

    The reviewers appreciate the goal of the paper, and two of the reviewers find it very interesting. All reviewers express concern with the size of the medical imaging dataset, and questions are raised as to whether this is the real cause of the observed differences.

    In their rebuttal, the authors should address:

    • The dataset size and its effect on the empirical results
    • The relevance of the problem given the concerns of Reviewer 1
    • The experimental concerns raised by Reviewer 3
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3




Author Feedback

We thank everyone for the valuable and helpful comments. We appreciate that R1&R2 agree with the importance of the problem we raise and address and that this would be of interest to the community. Below, we respond to the points raised, starting with those we were asked to by the area chair.

  1. Practical relevance (R1) There is practical need. For example, as mentioned in [8], “participants are free to withdraw from the UK Biobank at any time and request that their data no longer be used”. Several algorithms have been trained on these data from millions of participants and are now in the public domain. The case our method addresses is how can we continue to use these methods if re-training de novo is not possible. Indeed the UK’s Information Commisioner’s Office (shorturl.at/kuyEV), states “If the request is for rectification or erasure of the data, this would not be possible to achieve without having to re-train the model (either with the rectified data, or without the erased data), or deleting the model altogether.” Thus broader interpretations of right of erasure in GDPR cannot be excluded in the future.

  2. Dataset size and its effect (R1) Recent theory on long-tail learning [Ref1] suggests that increasing dataset size does not reduce the chances of having edge-cases. This makes sense since the large dimensionality of medical imaging data (eg. 10^6 pixels) by a simple rule of thump the curse of dimensionality will require at least 10^6 number of data. We agree though that adding another dataset would help. BraTS has few volumes so we might consider ADNI. If we manage to obtain data agreements in time we will include these results. R1 is concerned that we don’t see any cluster case. Fig. 3 shows that there are 6 patients which once removed lead to zero test set error. These 6 patients likely form a cluster.

  3. Experimental Concerns (R3)

    • Model architecture and training We adopt a model previously used for ACDC pathology classification presented at MICCAI 2021 [20]. This VGG-like model had the best performance over ResNet alternatives we considered. Other ACDC approaches use mixture of experts (e.g. random forests or ensembling) and combine information across slices and temporal phases with LSTM. Machine unlearning has not advanced yet to these architectures. It is not trivial to assess how overfitting controls the chance of memorisation. Some degree of memorization (or overfitting) is required for generalisation to optimal [Ref1]. It is quite possible that a less overfitted model may have fewer edge cases. We will investigate this more by redoing Fig. 3 with models trained less.
  • Confidence bars Experiments in Table 1 were performed with 3 random seeds. The updated results (with confidence bars) do not change the conclusions.

  • Prevalence of two hypotheses in imaging data It is a well-accepted argument that data make different contributions to the model [4,14,18, Ref1]. We are the first to show that even in medical imaging data two categories of patient data exist. This is consistent with these past observations. More interesting though is the portion of patient data falling to the edge case. We are keen to explore other datasets to see if this holds across (see point 2 above).

  1. More in-depth discussion (R1) We will happily expand issues of trade-off and relation to differential privacy. In brief, we do discuss the trade-off between model performance and respecting data protection which motivates the need for our approach. It appears that the noise level can affect such trade-off. Similar to [9], our approach is a weaker form of differential privacy.

  2. When a model has definitely forgotten (R2) We agree that without a golden standard, the threshold of a random decision (e.g. 0.5 in binary, 0.2 in our case) would help. With 0.2, conclusions in sec. 3 still hold though. We will highlight this.

Ref1 V. Feldman, Does learning require memorization? a short tale about a long tail. ACM SIGACT




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors delivered a convincing rebuttal addressing most of the reviewers’ concerns. The one negative reviewer did not update their review after the rebuttal, so I chose to go with the two positive reviewers and recommend acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper targets the interesting and timely tpoic of machine unlearning. The work is preliminar and more in depth investigation on larger datasets and meaningful models would be required to corroborate the conclusions made here.

    Nevertheless, the contribution of the paper is valuable for the conference, and the rebuttal was convincing in addressing most of reviewers comments.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Most reviewers agree that the paper is interesting and should be accepted. The application is debatable but the rebuttal makes some good points here.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9



back to top