Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Xiaodan Xing, Federico Felder, Yang Nan, Giorgos Papanastasiou, Simon Walsh, Guang Yang

Abstract

Synthetic images generated from deep generative models seem to become a silver bullet to data scarcity and data privacy issues. The selection of synthesis models is mostly based on image quality measurements, and most researchers favor synthetic images that produce realistic images, i.e., images with good fidelity scores, such as low Fréchet Inception Distance (FID) and high Peak Signal-To-Noise Ratio (PSNR). However, the quality of synthetic images is not limited to fidelity, and a wide spectrum of metrics should be evaluated to comprehensively measure the quality of synthetic images. In addition, quality metrics are not truthful predictors of the utility of synthetic images, and the relations between these evaluation metrics are not yet clear. In this work, we have established a comprehensive set of evaluators for synthetic images, including fidelity, variety, privacy, and utility. By analyzing more than 100k chest X-ray images and their synthetic copies, we have demonstrated that there is an inevitable trade-off between synthetic image fidelity, variety, and privacy. In addition, we have empirically demonstrated that the utility score does not require images with both high fidelity and high variety. For intra- and cross-task data augmentation, mode-collapsed images and low-fidelity images can still demonstrate high utility. Finally, our experiments have also showed that it is possible to produce images with both high utility and privacy, which can provide a strong rationale for the use of deep generative models in privacy-preserving applications. Our study can shore up comprehensive guidance for the evaluation of synthetic images and elicit further developments for utility-aware deep generative models in medical image synthesis.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_2

SharedIt: https://rdcu.be/dnwGE

Link to the code repository

https://github.com/ayanglab/MedSynAnalyzer

Link to the dataset(s)

N/A


Reviews

Review #4

  • Please describe the contribution of the paper

    The authors aim to disentangle the different metrics used to evaluate the quality of synthetic generated images in medical imaging from the perspective of fidelity, variety, privacy and utility in downstream tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The article propouses a four-dimensional evaluation metric for synthetic images, including a novel privacy evaluation score and utility evaluation score. They perform intensive experiments in over 100k chest X-ray images and show that different metrics might point to different applications and that they can’t be used as a one-size-fits-all.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No major drawbacks. The experiments seem to be well organised and conducted rigorously.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors will release code upon acceptance. Paper seems to adhere to guidelines of the conference.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Overall, I would like to congratulate the authors for the good work. The article is interesting and offers a new perspective on evaluation practices for synthetic data, which can have a relevant impact in the field. Disentangling the different use-cases and showing that a visually appealing generated image might not always correspond to “highly useful” in the context under consideration is something that might help bring and use more extensively synthetic data.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Clear organisation, well-conducted experiments and interesting results which can have an important impact in the field.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper established a comprehensive set of evaluators for synthetic medical images, including fidelity, variety, privacy, and utility. The authors used 100k chest X-ray images and two state-of-the-art deep generative models to draw three conclusions: (1) they demonstrated the negative correlations among synthetic image fidelity, variety, and privacy; there is an inevitable trade-off among different aspects of synthetic images, especially between fidelity and variety(2) we discovered that the common problems in data synthesis, i.e., mode collapse and low fidelity, can sometimes be a merit according to the various motivations of different downstream tasks; different downstream tasks require different properties of synthetic images, and synthetic images do not necessarily have to reach high metric scores across all aspects to be useful. (3) they showed that it is possible to achieve both privacy and utility in transfer learning p

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper has a good motivation and the work is important for practical applications
    2. The method design is sound.
    3. The writing is great.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. It would be much better if they could’ve used more than two generative models to further evaluate their findings.
    2. It would be much better if they could’ve asked more human experts to evaluate the images.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Not certianly sure about the reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. It would be much better if they could’ve used more than two generative models to further evaluate their findings.
    2. It would be much better if they could’ve asked more human experts to evaluate the images.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The motivation and the importance of this work should be recognized.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The study introduces a new framework to assess synthetic medical imagery data using the four parameters fidelity, utility, variety and privacy. the core novelty of the introduced framework lies in its ability to assess each of these parameters disentangled from the others. The study describes experiments to investigate how different rating scores of these dimensions for synthetic data generated using two different GAN methods affect the performance of down stream applications using such synthetic data. The study finds that poor ratings of some of the assessment dimensions - most importantly of fidelity - do not always negatively affect the performance of down stream tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of the study lies in introducing an improved understanding of the real-world applicability of assessment scores of synthetic medical imaging data for use in clinical workflows. In particular, the finding that low fidelity scores of synthetic imagery does not necessarily hamper the performance of classification tasks is important and could be used to improve clinical decision support systems.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of the paper lies in the relatively moderate degree of methodological novelty. The introduced assessment framework - while certainly applicable - combines known methods of synthetic data characterization. Furthermore, the proposition that there are cases in medical practice where ‘privacy is not an issue’ indicates a problematic misunderstanding of the use of medical data in clinical decision support systems. FDA SaMD guidelines have data privacy as an integral part. Even for pure research purposes ethical approval to use (synthetic) data will only be given when basic data privacy rules are implemented. There are use cases where privacy is less of an issue than in others but it should never be stated that ‘privacy is not an issue’.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have provided information that allows a very high degree of reproducibility (including github repositories and data sources).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    I recommend to investigate the applicability of the developed framework for non-clinical/non-medical use cases. Privacy is always a key concern for data management in health and medicine, thus the feasibility study comparing use cases where ‘privacy is an issue vs. privacy is not an issue’ is only of limited relevance in a clinical environment (see FDA SaMD regulation, FDA ML best practices, institutional ethics approvals for research). Sectors that might be benefitting more from the developed framework might for example include the creative arts and geo surveyance.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    key cons: lack of methodological novelty, limited clinical applicability key pros: an interesting coherent framework for assessing synthetic imagery data that might have broader applicability in non-clinical settings

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    This manuscript proposes evaluation metrics to assess the fidelity, variety, privacy, and utility of synthetic images. Experiments are performed on a large set of chest X-rays, identifying the presence of pleural effusion.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -S1) I agree with the global vision: in the evaluation of synthetic data, multiple dimensions should be considered.

    -S2) The metrics for fidelty and privacy are quite well explained and motivated.

    -S3) This paper may trigger interesting discussions at the conference.

    -S4) The comparison of fidelity and variety with visual scoring (Fig 2c,d, section 4.1) is very valuable.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -W1) The definition of evaluation metrics for variety and utility seem a bit ad hoc. -W1a) For variety, I could imagine several other definitions, one of them already proposed in the paper (“It is worth noting that variety can also be quantified by the standard deviation of the discrete latent features”). The blurriness of the average paper is easy to cheat: simply translating a single image by random offsets would result in a very blurry average. -W1b) For utility, the tested metric seems reasonable, but quite specific for the application of chest X-ray classification (but perhaps this is unavoidable for a utility metric).

    -W2) Some of the conclusions seem biased, and not completely justified based on the results. -W2a) “The fidelity and variety score calculated with our method matched perfectly with human perception (p<0.05)” -> looking at Fig 2c-d, I would not say “perfectly”. The correlations seem to be driven by the extreme points mainly. -W2b) sec 4.2, Fig 3a-b: the strong negative correlation is entirely driven by the addition of LDM, i.e. by one data point. -W2c) “First, the intra-task augmentation utility favors synthetic data with higher fidelity” -> this was shown for one application, so I think the conclusion is phrased too strongly. -W2d) “Fig. 4 (d) shows an interesting positive correlation between privacy and cross-task utility” -> I assume you mean Fig. 3d. The correlation shown there is not significant.

    -W3) “The result is shown in Fig. 2 (c). The fidelity and variety score calculated with our method matched perfectly with human perception (p < 0.05), and FID, which was highly influenced by mode collapse and increased over the diversity, failed to provide a valid analysis of image fidelity” -> I don’t see any results for FID in Figure 2c.

    -W4) Since the purpose of this paper is to propose the metrics, I would expect some more detailed investigation of the impact of seemingly arbitrary choices, like the use of VQ-VAE vs a normal VAE, the settings of Q and k, and the choice of variety measure (see W1a).

    -W5) The title is catchy, but not very informative or accurate.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Code will be released (but can’t check it now due to anonymisation), and public data was used. Section 3 and the supplementary material give quite some details. I would rate the reproducibility as very good.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    -D1) I don’t understand eq 5. I would expect it the other way around: when the synthetic image is part of a real set, the fidelity should be high (1) instead of low (0).

    -D2: I don’t fully understand section 4.1 and figure 2c. What does each data point represent? A set of different synthetic images (generated by a different method)?

    -D3: “However, we did observe a similar pattern of utility in our experiments shown in Fig. 3 (a-b).” -> I don’t understand what kind of pattern I should see there.

    -D4: In fig 3c-d, the different settings of \phi for the StyleGAN show quite random privacy values, not correlated with \phi. This is a bit unexpected, since Figure 1 in the supplement suggests that the effect of \phi is monotonous in terms of diversity and sharpness. Any idea why it does not seem to correlate with privacy?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    S3 is the strongest argument for acceptance. The paper is very comprehensive. Some parts are more convincing (e.g., S2, S4) than others (e.g. W1, W2). All in all, I think this paper could be interesting to discuss at the conference.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Metrics for evaluating the utility of synthetic images are much needed in the current state of medical imaging and thus there is overall strong agreement among reviewers that this paper would of great interest in spurring discussion at MICCAI. Overall, the paper is organized and written very well with clear and impactful motivations. Some reviewers have noted some weaknesses in the paper that should be addressed prior to publication, hence I strongly suggest that the authors address these in the final version. Namely, authors should address R3’s concerns on the use case of synthetic images when privacy may be an issue.




Author Feedback

N/A



back to top