Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Peter He, Céline Jacques, Jérôme Chambost, Jonas Malmsten, Koen Wouters, Thomas Fréour, Nikica Zaninovic, Cristina Hickman, Francisco Vasconcelos

Abstract

In recent years, the field of embryo imaging has seen an influx of work using machine learning. These works take advantage of large microscopy datasets collected by fertility clinics as routine practice through relatively standardised imaging setups. Nevertheless, systematic variations still exist between datasets and can harm the ability of machine learning models to perform well across different clinics. In this work, we present Super-Focus, a method for correcting systematic variations present in embryo focal stacks by artificially generating focal planes. We demonstrate that these artificially generated planes are realistic to human experts and that using Super-Focus as a pre-processing step improves the ability of a cell instance segmentation model to generalise across multiple clinics.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_70

SharedIt: https://rdcu.be/cVRsC

Link to the code repository

https://github.com/PeterTheHe/Super-Focus

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This manuscripts presents a method for standardization and super-resolution (in the slice direction) of the human embryonic data. The authors present a method for simulating realistic-looking focal planes, that can be used both for generating missing planes as well as for upsampling the data via super-resolution. The methodology is simple but efficient and the validation is convincing.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Strong and complete conference submission: various data, clear methodology and presentation, convincing validation. • Simple but efficient methodology resulting in improved performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • Some parts of this manuscript can benefit from more detailed explanation. For example, it is unclear how the decision on about which focal planes are missing (and need to be generated) is made.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    All the methods developed in this paper are clear and valid. Also majority of the implementation details are properly described and values of all the parameters are reported.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. It is unclear how the decision on about which focal planes are missing (and need to be generated) is made. This requires some sort of data alignment. From what I could deduce, this step was performed manually. In either case (if it was automated), this needs to be mentioned explicitly.
    2. It would be interesting to obtain more information about the acquisition hardware: same/different manufacturer, model, etc.
    3. Section 4.1. “… 4 (uniformly) randomly selected planes … ” I find somewhat difficult to interpret how selection can be random and uniform at the same time. Please clarify.
    4. From the description in Section 4.2 the reader can get an impression that the stacks were shown to the experts in this particular order: 50 real, followed by 50 simulated, and then 20 copies. Was this the case? Or were the stacks shuffled before being shown?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a good conference paper, with simple but efficient methodology, clear presentation and convincing validation.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    Authors are suggesting a generative model approach for generating the missing focal planes of different domains in human embryo microscopy. 3 different generators receive two consecutive planes and generate the third one (one for each case, up, down and middle). An autoencoder extracts the features and reconstructs the input. A discriminator helps with adversarial loss and finally latent space feature representations from the generated image and its ground-truth are aligned with a self supervised loss. Results are reported i) qualitatively ii) on embryo grading iii) on single cell segmentation

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Relevant problem solved with an interesting combination of different methods.
    • Results studied from various aspects (Embryo grading, cell segmentation, expert qualitative assessment)
    • impact of the training dataset is shortly discussed
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Is Frobenius norm the best for L_per? Does a different loss such as cosine similarity make any difference?
    • It was not studied if there is any domain shift between the datasets from different centers due to different settings of microscope. The paper only talks about the number of focal planes assuming different domain and centers produce identical images.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Since all datasets are private, reproducing the reported results is not possible. Pretrained models are also not released. One can only get ideas from reading the text on similar datasets/problems.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Probably UMAP of the embedded features from the autoencoder can help us to have a better understanding if there is any domain shift or not.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    They are suggesting a relatively simple method addressing dataset standardisation across different domains.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a method for predicting missing slices in embryo imaging data sets. The model is trained in a self-supervised manner and evaluated on a large data set including some tests with four human raters. Multiple generator models allow predicting missing slices either below, above or between two existing other slices. While the methods don’t seem novel per se, the application to this problem are reasonable and the results indicate benefits of using this additional super-resolution approach.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Nicely written and well-understandable paper that tries to solve an important problem.
    • Self-supervised , i.e., no manual annotations required.
    • Large training / test data set
    • Several interesting ablation studies including the fact that FID apparently is more suitable to judge the realism of generated images.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There’s not much to criticize from my side but I had a few questions that may be worth answering in the paper:

    • You only use D_A for training of your model. In Table 2 it seems that the model consistently performs worse on data sets C-E. Why not including a few images of these clinics as well to increase the variability seen during training? As it’s a self-supervised approach it should be straightforward to include other training images as well.
    • In Table 2, it also would have been interesting how human raters assess the unaltered images without any additional slices. Do complete images also obtain a score of 5 as would be expected?
    • You mention several changes you made to the original U-Net architecture but don’t explain why those changes were made. The same applies to the training: you state you train for 30 epochs but don’t mention any criterion for stopping the training. Please comment.
    • I did not fully understand how you know when to apply which of the generators. Are slices systematically missing, such that you could use the same generator for all data of a particular clinic? Or is there any other sophisticated way of identifying the missing slices?
    • I do understand that predicting a slice between two existing ones might be reasonable and doable (similar to interpolating between the slices). However, I have some doubts that the predictions at the border regions (above / below the acquired stack) are guaranteed to reflect the reality. Is there any way how to assess the validity of the slices in these border regions?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Seems reproducible, given that the authors indeed provide the code in case of acceptance.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    In addition to the suggestions mentioned earlier, some minor comments:

    • Table 4 appears before Table 3. Consider swapping the labels.
    • Fig. 3: mention that within one panel the pairs are real/generated, respectively. Otherwise, one could think the two left pairs are real and the two right pairs are fake.
    • Please carefully go through the references again. Some of them are incomplete or lacking page numbers and the like (e.g., 9, 12, 13, 32).
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Solid conference contribution that solves an interesting problem, nicely written paper, reasonable selection of methods that are not new per se but used in a clever way. Quite some validations including multiple expert assessments and relatively large data sets.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposes a method for generating missing slices/focal planes in human embyonic data. All reviewers agree on the novelty of the approach and that this paper is well written.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

We thank the reviewers for their constructive comments.

A: The imaging hardware gives us metadata on the focal depths present in a stack. We can thus easily identify missing planes that need generating in the case that not all planes are present. In the case of stack misalignment, the mean offset was calculated manually over the stacks. Planes can then be generated in the direction of the offset to correct for it. We used generators G^up and G^down to generate edge planes. We used generator G^mid in all other cases.

The planes were randomly sampled from a discrete uniform distribution.

Yes.

All datasets were captured on Embryoscope or Embryoscope+ incubators manufactured by VitroLife. They are widely used and offer some standardisation off the bat.

We quickly calculated the FID score between un-preprocessed central focal plane images of 4-cell embryos between the two largest datasets, A (n=2275) and B (n=146), to obtain an estimate of this visual domain shift and got a value of 61.7. Though this value should be taken with a pinch of salt due to the small sample sizes, it does suggest some visual domain shift. However, many other works have explored domain adaptation in the face of visual differences. Our work, in contrast, focuses on a much less explored source of domain shift which we believe to be sufficient scope for a single paper.

The goal of L_per is to compare the feature maps of the ground truth and generated images. The Frobenius norm allows L_per to boil down to the mean square elementwise distance between the feature maps. We also considered the cosine similarity but decided against using it since it does not consider the relative magnitudes of the features, which is critical for perceptual quality.

We suspect the worse performance on datasets C-E may be due to having to generate more planes (in the case of dataset E), having to generate planes from other generated planes, or visual domain shift (as brought up by R2). We did not include some samples from datasets C-E since the goal was to evaluate adaptation to new never-seen-before domains.

We agree that this would be an interesting baseline experiment. We have not collected any data on this and unfortunately, given the time constraints of the rebuttal period, are unable to provide any insights into the matter.

We motivate each architectural change with respect to the original U-Net below. *BatchNorm – we add batch normalisation to our network as it is a well-known method to improve training stability.
*Shorter downsampling pathway – the images we are working with in our network are smaller than those in the original U-Net paper. We thus thought it would be sensible to remove a downsampling step (with the added benefit of reducing the network’s memory footprint). *Single-channel input – our images are all greyscale. The number of training epochs for the final models was determined from preliminary experiments (the FID on the validation set plateaued around then).

In our work, we relied on qualitative assessment of the generated planes by experts who are familiar with the modality. They were able to provide comments (see C3 in the supplementary materials). While they commented on a couple of other artefacts, the quality of the extrapolation model itself did not seem to be a major issue.



back to top