Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jinyi Xiang, Peng Qiu, Yang Yang

Abstract

In recent years, various semi-supervised learning (SSL) methods have been developed to deal with the scarcity of labeled data in medical image segmentation. Especially, many of them focus on the uncertainty caused by a lack of knowledge (about the best model), i.e. epistemic uncertainty (EU). Besides EU, another type of uncertainty, aleatoric uncertainty (AU), originated from irreducible errors or noise, also commonly exists in medical imaging data. While previous SSL approaches focus on only one of them (mostly EU), this study shows that SSL segmentation models can benefit more by considering both the two sources of uncertainty. The proposed FUSSNet framework is featured by a joint learning scheme, which combines the EU-guided unsupervised learning and AU-guided supervised learning. We assess the method on two benchmark datasets for the segmentation of left atrium and pancreas, respectively. The experimental results show that FUSSNet outperforms the state-of-the-art semi-supervised segmentation methods by large margins.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_46

SharedIt: https://rdcu.be/cVVp1

Link to the code repository

https://github.com/grant-jpg/FUSSNet

Link to the dataset(s)

Pancreasdataset: https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT

Leftatriumdataset: http://atriaseg2018.cardiacatlas.org


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper describes a framework for semi-supervised deep learning-based segmentation that incorporates both aleatoric and epistemic uncertainty. Epistemic uncertainty is used to divide the image into “certain” and “uncertain” areas, which are handled differently in the semi-supervised learning approach. The aleatoric uncertainty is used in the supervised part of the loss.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    I like the main idea of using both aleatoric and epistemic uncertainty in the semi-supervised learning framework

    The comparative validation and ablation study are generally good (but see concerns below)

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Lack of clear statement of contributions in context of most closely related work

    Details of hyperparameter optimisation are unclear

    There is no statistical testing for significance of results

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Overall good, code has been made available. But details of how optimal hyperparameters were arrived at is lacking (see below).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    There were a number of aspects of the proposed framework that I liked. It is interesting to investigate the effects of using both aleatoric and epistemic uncertainty. I also like the way that epistemic certain/uncertain regions are treated differently by the semi-supervised learning. I think there is a novel methodological contribution in this paper, but I don’t think it came across that clearly in the authors’ summary of their contributions at the end of the Introduction. E.g. they highlight the different treatment of certain/uncertain regions as a contribution, but in Section 2.3 they mention that [14,17,20] have already done something similar. It would be useful to clearly state the authors’ contributions in the context of the most closely related work. Currently it is difficult to do this in the Introduction because a lot of the most relevant work is only discussed/cited in the Methods (i.e. Sections 2.3/2.4). I think the paper could be organised better by having a Related Work section to more thoroughly review the literature, then the authors’ contributions could be more clearly stated in this context.

    I also have some concerns about hyperparameter optimisation. There are a number of hyperparameters in the proposed framework but there is no information about how they were chosen, what data were used for optimisation and (sometimes) even what the final values were (e.g. what was the value of lambda in Eq 1?). This is all important information to help the reader interpret the results. E.g. for the left atrium (LA) dataset there is mention of training and validation sets but no test set. Does this mean that the validation set was used for hyperparameter optimisation, and that these are the results reported? Or did they use the 54 cases of the challenge test set at http://atriaseg2018.cardiacatlas.org/? This was not mentioned in the paper. Finally, what hyperparameter optimisation was performed for the comparative validation models and the ablation study experiments?

    Some of the differences in performance, especially on the LA dataset, seem to be quite small, and I believe that the test (or validation) set was only 20. Could some statistical significance testing be included? Also, I presume the results presented are for a single run of all models? What would happen if the authors simply reran all their experiments with a different random seed – would they still see the same differences?

    Other minor comments:

    • Can the text in Fig 1b be made bigger? It is quite hard to read at the moment.
    • P4: “Thus, we do not enforce consistency constraint in certain areas …” I know what the authors are trying to say but I think this choice of words is misleading – it sounds like they are saying that they do not treat different areas differently, which is directly contradicted by the following Eq 1. I would suggest something like: “Thus, we do not enforce the consistency constraint in areas considered to have low epistemic uncertainty …”
    • It would be interesting to see some examples of the epistemic certain/uncertain areas. E.g. is this just highlighting boundary regions as uncertain? I understand that the authors are constrained by the page limit but some insight into this would be useful, even if it were just a sentence in the Discussion.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My main two concerns were the lack of a clear statement of contributions, and the lack of clarity about aspects of the validation, especially hyperparameter optimisation.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    I think the authors have done a reasonable job with their rebuttal, even responding to the concerns of myself and R3 on doing multiple runs. But I still have concerns about the statement of novelty with regard to previous work. In their rebuttal the authors make such a statement quite well (better than in the submitted paper). But my concern is that most of the papers they mention in this statement are not currently cited in the introduction of the paper - they are only cited in the Methods. Therefore I do not see how they could include such a statement in the paper without a significant revision, i.e. moving much of the discussion of prior work from the Methods (Sections 2.3/2.4) into the Introduction. In my opinion, this type of revision is beyond what is expected of a post-rebuttal revision.

    Therefore, on balance, although I do like aspects of the paper, I don’t think it can be published as submitted and I stick with my recommendation of Weak Reject.



Review #2

  • Please describe the contribution of the paper

    This paper is built on a joint training regime of epistemic and aleatoric uncertainty. Particularly, the authors computed the epistemic uncertainty from four different classifiers on top of a shared embedding and used the thresholded uncertainty as a mask for computing loss between student and mean-teacher networks. They have shown two dataset experiments that showed marginal performance gain over previous methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The evaluation is detailed on two different datasets showing slight improvement over previous methods, which has clinical relevance.
    2. The novelty of this paper is limited, as explained below. However, the ablation regarding PL and CR is interesting and can be helpful to other student-teacher unsupervised learning settings.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The technical contributions are a bit weak. This paper combines the existing idea of an uncertainty-based mask [14] with an extra aleatoric loss function [11] which is a trivial extension.
    2. Enforcing the consistency loss for the EU regions of the image between teacher and student prediction seems empirical and not well supported. One can argue that teacher and student networks can have the same EU, and thus consistency loss will not be helpful in such a scenario.
    3. Inclusion of aleatoric uncertainty with epistemic uncertainty did not improve much over epistemic one, and there is no statistical test reported on the contribution coming from aleatoric uncertainty. This raises the question: how much weight does one need to provide between them in the final loss function? Is this something dataset-specific?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The details about the experimental setup are thorough. And the authors have promised to release the code upon acceptance which is good for reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. On page 4, the authors said, “The CE loss and focal loss focus more on pixel-level features, while Dice loss and IoU loss care more about shape information.” This is not true; Dice and IoU do not care more about shape; they are still a pixel-level loss. This and the next sentences need correction.
    2. Eq. (3) the softmax would be on the channel dimension, right? If so, the index j should be out of the softmax function. softmax(\eta_i^(t))_j.
    3. Unnecessary acronyms and symbols can be avoided for better readability, e.g., MIS, EMA. Also, please introduce ASD and HD before using those acronyms.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper lacks technical contribution, however has potential room for discussion since this simple idea has produced good results. Hence, I recommend weak acceptance in the initial round.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The authors addressed some of my concern in the rebuttal.



Review #3

  • Please describe the contribution of the paper

    The paper proposes an iterative multi-step training strategy called “FUSSNet” for medical image segmentation tasks consisting of unsupervised and supervised learning aspects. In different steps, epistemic and aleatoric uncertainty is exploited to boost segmentation performance according to multiple metrics (pixel-level and global-level). The authors claim superiority to other recent semi-supervised methods on two challenging segmentation tasks (on MRI and CT) in the small data regime (only 12 or 16 labeled datasets used per task).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper addresses uncertainty at varíous levels, which contributes to an important trend that emerged over recent years in the MICCAI community. The way uncertainty is exploited here in the context of anatomical segmentation appears meaningful to allow learning useful features from both labeled and unlabeled data, in particular in challenging regions of the image (organ borders, noisy regions, imaging artifacts, etc.). In the future, this framework may even be worth extending to potentially include an active learning component to allow for a dynamic increase of meaningful labeled data to further boost its performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Evaluation is not fully convincing, especially given the sometimes small margins for various metrics between competing methods. The main reason is that the paper uses small datasets only (~100 volumes per task, of which only 12 or 16 labeled ones were used) and at the same time reports only the results of a single training run per task/method. To minimize the risk of random lucky results, the authors should re-run the same experiments multiple times with different random network weights initialization, and more importantly, different random splits of training datasets (at least shuffle the labeled and unlabeled training datasets randomly). Then the distribution of the resulting metrics should be reported and compared to state of the art (e.g. using mean and standard deviation, if applicable). Is FUSSNet still coming out on top for both tasks? Furthermore, were other tasks tested and are the authors aware of any limitations of their approach?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors use public datasets for benchmarking and employ a widely used network architecture, etc., all aiding reproduciblity. Moreover, the code will be made available. However, it is not clear if the code will also provide information about the data split (train / validatiation and train with vs. without label for the different steps). This would not be an issue, if the authors reported results based on multiple random splits of the same data instead of single data points (e.g. reporting mean +- std.dev. of reported metrics over e.g. n=5 training runs of the full pipeline).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Abstract: “… outperforms the state-of-the-art … by large margins” -> consider a more quantitative statement such as “by up to XX% DICE” or similar.
    • Fig 1: some text is too small to read, especially some boxes on Fig 1b; please enlarge
    • Fig 1a and text indicate an iterative nature of the framework. The paper is lacking convergence criteria and discussion about the added complexity (runtime, memory) for training, especially in light of the comments regarding other methods computational cost issues (MC sampling)
    • Fig 1a: not clear what pretraining entails (first box in Fig 1a): what is the training objective, what data is used, etc. I suppose it is a fully-supervised training of the V-Net directly on the segmentation task using the selected training datasets that are also used later in the training? Please clarify.
    • page 4: “four classifiers share the encoder but differ in loss functions”: what about the decoder? Why is only one forward pass required (or only one forward pass for encoder, but four for different decoders (?))? This part needs clarification
    • page 5, eq. 4: What is the value of \omega? And right below: “weighted sum of …: how are L_sup and L_unsup weighted against each other? Are these hyperparameters difficult to tune, and where the same used for both tasks?
    • The paper highlights several times that AU is modeled in logit space, but it is not well motivated why and if it helps

    • More details on experiment section
    • FUSSNet appears relatively complex to train. Therefore it would be great to compare not only final performance metrics, but also time until convergence to those results for each method
    • it would be interesting to see the impact of increasing the number of labeled datasets in training (or changing the ratio of labeled vs. unlabeled training datasets)
    • page 5: Unlabeled data not truly unlabeled, because data is preprocessed (crop) based on labels.
    • page 6: sliding window patches are very small (16x16x4 pixels) -> can the authors motivate why? memory limitation or does a small window iprovide benefit in any way? Same setting used for comparsion methods? Maybe this very small context explains the many disconnected regions in some thumbnails in Fig 2.
    • “theoretical upper bound” (used to describe performance with a standard training strategy using all training data) -> this term is not adequate here. Please use “fully-supervised upper bound”, “fully-supervised reference result”, or similar.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Focus on uncertainty is one of the strongest points of the paper, however the method is relatively complex, while gains over other methods may not be overly significant. Also, one of the main claims/novelties, the one regarding aleatoric uncertainty, turns out to provide only minimal benefit.

    My main criticism is that the evaluation is not 100% convincing given that only single experiment results are being reported (instead of mean+-std.dev. over multiple randomized runs) using very small number of training datasets. Especially in this low data regime, random selection of training datasets can have a huge impact. But since the results of FUSSNet are somewhat outperforming all other methods in both experiments, assuming the authors have not hand-picked results, I am eager to learn more details at about their work at MICCAI.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper proposed a semi-supervised learning framework for image segmentation by incorporating both aleatoric and epistemic uncertainties. It’s interesting to see the exploration of fusing two sources of different uncertainties and experimental evaluation on two datasets shows improvement. However, major concerns have been raised consistently by reviewers, including weak technical novelty and lack clear contribution statement from existing studies, experimental setting concerns on hyperparameter optimization and fair evaluation (e.g., small datasets, single training run), and marginal improvements without statistical significance testing results. The authors are invited to address these questions in the rebuttal.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4




Author Feedback

Q1. Weak Technical Novelty We enhance uncertainty-aware learning [2,3,8,20,22] by bringing aleatoric uncertainty (AU) into semi-supervised learning (SSL). Most previous studies in SSL only focused on epistemic uncertainty (EU). As far as we know, FUSSNet is the first attempt considering both EU and AU in semi-supervised medical image segmentation. FUSSNet improves the computation efficiency in both EU- and AU-guided learning parts, and increases overall accuracy. 1) For EU-guided training, we replace the time-consuming Monte Carlo dropout with an ensemble of decoders. Compared to [14], we have a different EU assessment module. [14] used 2 extra classifiers with differently weighted CE loss (3 in total), while we use 4 decoders with different loss functions, resulting into more diverse predictions. The ablation results suggest that our EU assessment strategy is more effective than [14] (the EU part of FUSSNet outperforms [14] in Table 3). 2) We propose a new loss based on [11], which reduces computation cost and speeds up convergence . Suppl. material provides the proof. 3) The proposed framework models two sources of uncertainty in a systematic way and has a great potential in extending to active learning or interactive segmentation scenarios.

Q2. Concerns on data split, small data, and hyperparameters. 1) We use the same data split as previous works [2,3,6,8,14,18,20,22]. All baseline methods and ours used the same images with a fixed train-test split, thus we did not report shuffled-data result. Previous studies split the left atrium (LA) and Pancreas data into train and test sets without explicit val set (as described in [8]’ Dataset section). 2) The 2 datasets are widely-used benchmarks. The SOTAs [2,3,6,18,20] used only LA and [8,22] used 2 sets. We will apply FUSSNet to large-scale and more ambiguous real-world data in our future work. 3)The hyperparameters, like lamda in Eq(1) and the weight for L_sup and L_unsup, are all set to 1 in all experiments, except for omega in Eq. (4) which is data-specific. As omega controls the AU-guided training loss, empirically we set a larger value for data with more ambiguity. As can be seen in data, pancreas CTs have clear boundaries while LA data is blurrier (Fig. 2). We randomly pick 2 labeled samples as val set to decide the specific value of omega, and then retrain the model using all labeled samples (omega=0.1 for pancreas and 0.8 for LA). 4) [R3] We use the same backbone model and sliding window patch size as [2,6,8,14,18,20]. The pretraining is a fully supervised learning using the labeled data. Then the labeled data is used only in AU-guided training.

Q3. Marginal improvement on LA and no stat significance test For LA, our performance (with only 20% labeled training data) is almost as good as the fully-supervised upper bound. Besides, as LA contains more scans, it is reasonable to have less improvement compared to the pancreas segmentation. We ran our model for 10 times and here are the mean(std dev): Pancreas: Dice 81.51(0.242) Jaccard 69.34(0.317) ASD 1.68(0.142) 95HD 5.90(0.421). LA: Dice 90.99(0.115), Jaccard 83.50(0.224), ASD 1.78(0.163), 95HD 5.57(0.240). FUSSNet still has overall best performance on both datasets, and has obvious advantage on Pancreas. We ran the best SOTA([14]) on Pancreas dataset for 10 times, and compute the stat. significance for performance difference using pairwise t-test. Here are [14]’s results in the form of mean(std dev/p-value): Dice 79.485(0.246/p<0.0001), Jaccard 66.52(0.296/p<0.0001), ASD 2.83(0.719/p=0.0235), 95HD 11.49(2.13/p=0.002). All results show stat. significant difference at 95% level. (The best SOTA for LA ([18]) has no accessible code.)

Q4: AU’s contribution is small. (R2) The contribution of AU depends on the ambiguity in data. AU improves more on LA than Pancreas, i.e. Dice: 1.42, Jaccard:2.39, ASD: 0.66, 95HD: 2.08. Note that the two benchmark sets have relatively high quality, AU may contribute more to real world data.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have addressed most of concerns in the rebuttal, thus I favor acceptance. Please take the comments and suggestions from reviewers into consideration in the final version.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    trying to address the uncertainty problem is a plus, empirical results are a bit lacking

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper studies the problem of semi-supervised image segmentation using a neural network to make use of both aleatoric uncertainty and epistemic uncertainty. While the studied problem is apparently important for medical image computing and the paper is generally written well, the presented idea is a pretty straightforward combination of aleatoric and epistemic uncertainties. Considering that both aleatoric uncertainty and epistemic uncertainty have already been studied in multiple existing papers, the novelty of this paper is limited. I agree with the post-rebuttal assessment by Reviewer #1 in that the current form still need a significant revision before it can be published.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



back to top