Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Mou-Cheng Xu, Yukun Zhou, Chen Jin, Marius de Groot, Daniel C. Alexander, Neil P. Oxtoby, Yipeng Hu, Joseph Jacob

Abstract

This paper concerns pseudo labelling in segmentation. Our contribution is fourfold. Firstly, we present a new formulation of pseudo-labelling as an Expectation-Maximization (EM) algorithm for clear statistical interpretation. Secondly, we propose a semi-supervised medical image segmentation method purely based on the original pseudo labelling, namely SegPL. We demonstrate SegPL is a competitive approach against state-of-the-art consistency regularisation based methods on semi-supervised segmentation on a 2D multi-class MRI brain tumour segmentation task and a 3D binary CT lung vessel segmentation task. The simplicity of SegPL allows less computational cost comparing to prior methods. Thirdly, we demonstrate that the effectiveness of SegPL may originate from its robustness against out-of-distribution noises and adversarial attacks. Lastly, under the EM framework, we introduce a probabilistic generalisation of SegPL via variational inference, which learns a dynamic threshold for pseudo labelling during the training. We show that SegPL with variational inference can perform uncertainty estimation on par with the gold-standard method Deep Ensemble.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_56

SharedIt: https://rdcu.be/cVRza

Link to the code repository

https://github.com/moucheng2017/EMSSL

Link to the dataset(s)

https://arteryvein.grand-challenge.org/Home/

https://www.med.upenn.edu/sbia/brats2018/data.html


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper, authors present semi supervised segmentation method based on psudo labels of the unlabelled images.

    • The authors present the approach where the model and psudo labels are updated iteratively where in E step the psudo labels are generated and in M step the model is updated using both label and psudo label.

    • Dice loss used by authors need thresholding of psuolabels, so author variational inference based approach to compute the the threshold T.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea if optimizing the threshold parameter during training is really interesting. Since the threshold parameter is the output of the network, with this method, different pseudo labels can have different threshold value to be used in the next iteration. This also eliminates the hurdle of manual optimization of the hyper parameter, This approach can be used for computing the segmentation uncertainty and is also shown to be robust against adversarial attack.
    • The experiments and results are extensive.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Authors have not mentioned how many iteration of EM is performed in the experiment.
    • While SegPL significantly outperforms existing semi supervised approach, the SegPL-IV method gave only minor improvements over SegPL.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors have provided necessary information to be able to reproduce this work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Please mention the number of EM iterations used and how the average value of T changes over the EM iterations?
    • In Equation 8, the number 50 seem to be typo. Do you mean 0.5?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea of threshold optimization using variational inference for pseudo label based semi-supervised training is novel. This method can also be used to compute segmentation uncertainty.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This work presents a new formulation of pseudo-labelling as an Expecation-Maximization (EM) algorithm for clear statistical interpretation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work proposed a new formulation of pseudo-labelling as an Expecation-Maximization (EM) algorithm for clear statistical interpretation. And the authors further introduces a probabilistic extension of SegPL using variational inference, which learns a dynamic threshold during the training. The two formulations are novelty and may provide some new insights in the future.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The idea is novel and interesting. But the some descriptions are not precise enough. In addition, the experiments datasets are too small, the experimental results are different from many previous works (the reported basline and comparison results are too bad). (1) To best our knowledge, there are too many pseudo label based semi-supervise segmentation methods, this work is not the first one!!! (2) The two datasets are too small, why not use the whole BraTS dataset?And the results of BraTS is too bad, it’s not convincing and reasonable. The CARVE dataset just has 10 volumes, why not use a big dataset? There are too many large-scale and open-access datasets. (3) If possible, please provide the results of distance based metric, like HD or ASD. The statistical analysis results also should be reported. (4) The methods missed too many recent works about the same topic[1,2,3,4,5, ……].

    [1] Semi-supervised learning for network-based cardiac MR image segmentation, In MICCAI2017. [2] Semi-Supervised Segmentation of Radiation-Induced Pulmonary Fibrosis from Lung CT Scans with Multi-Scale Guided Dense Attention, in TMI2021. [3] Efficient Semi-supervised Gross Target Volume of Nasopharyngeal Carcinoma Segmentation via Uncertainty Rectified Pyramid Consistency, in MICCAI2021. [4] Semi-supervised Left Atrium Segmentation with Mutual Consistency Training, in MICCAI2021. [5] Semi-supervised Medical Image Segmentation through Dual-task Consistency, in AAAI2021.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Maybe.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    See the weakness comments.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work proposed a new formulation of pseudo-labelling as an Expecation-Maximization (EM) algorithm for clear statistical interpretation. And the authors further introduces a probabilistic extension of SegPL using variational inference, which learns a dynamic threshold during the training. The two formulations are novelty and may provide some new insights in the future.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes a novel semi-supervised segmentation method using pseudo-labels with an extension to variational inference. The method jointly trains a neural network on few labeled images and more unlabeled images in a two step approach. The authors link pseudo-labeling to the expectation-maximization framework. The extended method uses variational Bayesian inference to also estimate the label threshold used for generating pseudo-labels.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written, the method is novel and it is empirically shown that the method is effective. The link between EM and pseudo-labeling is interesting. The method is compared to different consistency-based baselines and yields better results. Moreover, SegPL is robust towards distribution-shift and adversarial attacks. The paper addresses an important problem, as expert annotations are especially costly in medical imaging.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No major weaknesses. I only have some minor comments (see below).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The method is described well enough to be reproduced. The paper uses publicly available datasets and the code will be published.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Minor comments:

    • If T should be between 0 and 1, why not use a Beta prior and/or Beta likelihood? Using a Normal distribution as variational distribution can lead to values outside [0, 1]. How do you deal with invalid values?
    • I appreciate the Bland-Altman plot. However, I did not know that statistical significance can be derived from it. Please elaborate on that or use a proper statistical test and report p-values.
    • Tab. 1 and 2: I assume that the values are mean ± std. Please state that.
    • Uncertainty estimation with SegPL-VI: As far as I understand, only the estimation of T is implemented in a probabilistic/variational manner. I find it quite a stretch to argue that the deterministic segmentation network can produce reasonable stochastic segmentations with that.
    • The problem with Brier score as calibration metric is that it heavily depends on model accuracy. I think that a proper calibration metric such as (classwise) ECE or adaptive calibration error would be better suited.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is the best paper in my batch and I really enjoyed reading it. The method is novel and well described, and the results are good. It will be a nice contribution to MICCAI.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All reviewers noted that the paper introduces a novel idea and demonstrates its usefulness with sufficient experiments. Concerns brought up were relatively minor such as lack of clarity in certain places which the authors should be able to address. Not using the whole BraTS dataset was also thought as a weakness. All reviewers voted to accept the paper and I am in agreement.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

We thank the reviewers for their constructive comments. We appreciate the recognition of the methodological novelty (R1, R2, R3), extensive good results (R1, R3) and “interesting new formulation” (R2, R3). We especially thank R3 for the very kind comment “I really enjoyed reading it”. We will update our manuscript according to reviewers’ suggestions, e.g. we will add more related works as suggested by R2. Now we start to respond to each reviewer.

To R1: 5.1 and 8.1: Each EM update is each step in the SGD optimisation. We trained on BRATS with 200 EM steps and we trained on CARVE with 800 EM steps. We would also like to point the reviewer to the Appendix for the details of the training hyperparameters.

5.2. SegPL-VI potentially can improve further, if more hyperparameter searching is performed. Additionally, different priors could also be explored to further improve the results.

8.2. We thank the reviewer for pointing out the typo, we will update the manuscript accordingly.

To R2: 5.1. And 5.4. We thank the reviewer for pointing us to more related works. We are the first to use and focus on the ORIGINAL pseudo labelling without any post processing for semi supervised segmentation of medical images. In contrast, the mentioned works by the reviewers do not focus on the pseudo labelling and they use pseudo labelling as an auxiliary PART of their systems, additionally, these mentioned works use pro processed pseudo labels different from the original formulation of pseudo labelling. We will update our manuscript to include the suggested related works and highlight the differences between ours and the related works.

5.2. We thank the reviewer’s suggestion of using more diverse and bigger datasets, we will include more datasets in our follow up future submissions.

5.3. We provided a Bland-Altman plot in Figure 2. We will also update our manuscript and add statistical analysis and p value.

To R3: 8.1. The use of Beta prior is a really good suggestion and we will investigate the use of Beta prior in the future work, other possible priors such as categorical priors might also be explored in the future work. The reviewer is indeed correct about the risk of the learnt threshold overshooting out of the 0-1 range. In practice, we did not use 0 mean, 1 std Normal distribution like the vanilla VAE, for instead, we used 0.4 or 0.5 mean, 0.1 std Normal distribution as prior to avoid overshooting of the learnt threshold. However, even with a prior as 0.5 mean and 0.1 std, the learnt threshold could be outside of 0 and 1 range in the first steps of the optimization due to random initialization, luckily, the learnt threshold normally converges and stays close in the range after the training is stable. We really thank the reviewer for raising this question and we will update the manuscript to reflect this conversation and if there is not enough space in the paper, we will make sure to put this in the GitHub readme file.

8.2. The bland-altman plot shows that our method outperforms the baselines on most of the testing images so we expected that our method is statistically better, however, we will include a proper statistical analysis and report p values in the updated manuscript.

8.4. We agree with the reviewer that if a segmentation model is trained deterministically, it would not capture the posterior because of the posterior collapse that the posterior simply could shrink to a delta function. However, the training of our segmentation model is a combination between labelled data and unlabelled data, where the unlabelled data part is trained stochastically and the labelled data part is trained deterministically. In addition, the threshold model is also detached from the segmentation model to avoid posterior collapse.

8.5. We thank the reviewer for pointing out the limitation of the uncertainty metric Brier score. We will investigate the use of ECE and adaptive calibration error and include them in the updated manuscript.



back to top