Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yixuan Wu, Bo Zheng, Jintai Chen, Danny Z. Chen, Jian Wu

Abstract

As deep learning methods continue to improve medical image segmentation performance, data annotation is still a big bottleneck due to the labor-intensive and time-consuming burden on medical experts, especially for 3D images. To significantly reduce annotation efforts while attaining competitive segmentation accuracy, we propose a self-learning and one-shot learning based framework for 3D medical image segmentation by annotating only one slice of each 3D image. Our approach takes two steps: (1) self-learning of a reconstruction network to learn semantic correspondence among 2D slices within 3D images, and (2) representative selection of single slices for one-shot manual annotation and propagating the annotated data with the well-trained reconstruction network. Extensive experiments verify that our new framework achieves comparable performance with less than 1% annotated data compared with fully supervised methods and generalizes well on several out-of-distribution testing sets.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_24

SharedIt: https://rdcu.be/cVRY6

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a new method for 3D image segmentation that requires one slice annotations, addressing the problem of training data availability.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method is very interesting, addressing an important problem in medical image segmentation. The work is well motivated, and the empirical results are convincing. A good number of baseline / alternative methods have been used for comparison and an ablation study provides insights about the importance of individual components. Four datasets/applications have been used for evaluation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper could be a little bit more clear at times, where terms such as one-shot are used in a possibly unusual way without exactly defining what is meant here.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Not assessed.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • It would be good to clearly define terms such as one-shot, as this is not very clear and the use of the term seems slightly different from other works.

    • I would suggest to replace ‘inference’ with ‘test-time’ throughout the paper, as inference is something slightly different than what the authors meant here. The use of the term ‘inference sets’ is misleading, and should be changed to ‘testing sets’ or ‘evaluation sets’.

    • The introduction may need a reference or sentence to explain what ‘human-machine disharmony’ means.

    • Unclear what the authors mean by ‘enormous semantics’ in Method section.

    • From the definition, it is unclear whether the authors assume all volumes to have the same number of slices D.

    • I would suggest to rename ‘Featuring Module’ to ‘Feature Extraction Module’, and ‘Reconstructing Module’ to ‘Reconstruction Module’.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a good paper with an interesting method and thorough evaluation.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose to learn image features that allow both (1) a “representative” slice to extracted from a 3D volume and (2) propagate manual annotations from it to the rest of the volume. First, to find the single slice to label, the method clusters the slices (K-means clustering), and finds the most representative slice from each cluster as that with the maximum summed cosine similarity between its learned features and the other slices in the cluster. Second, to propagate labels from slice to slice, the method weights the contribution of each pixel in the already labeled slice by the similarity of its features to the pixel in the unlabeled slice. The features are used for both steps, and are learned by the ability to reconstruct one slice from another when the weighting is again based on similarity in the feature space.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Very useful strategy to allow users to annotate a single slice - this is a very user-friendly interaction strategy.

    • Good comparison to lots of alternative methods for dealing with small annotations and good improvement as measured by Dice score, ASSD, HD, etc.

    • Investigation into domain shift from one dataset to another.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Novelty - in my opinion, most of the proposed method is certainly a sensible solution to the problem, but not particularly novel or thought provoking.

    • Binary segmentation only - this paper appears largely limited to binary segmentations, as the single segmented slice must contain all labels in order for them to be propagated and this is unlikely in many multiclass segmentation problems. In particular, this significant limitation is never discussed.

    • Depends on the interaction between the slice orientations and the anatomy’s shape - the proposed method propagates segmentations from slice to slice. In particular, it learns pixelwise features that are used for each pixel in the unsegmented slice to find the most similar pixels in the previously segmented slice. Such methods can fail if the shape of the object’s cross-section changes rapidly from slice to slice. In addition, the authors also restrict the matching to pixels within a 13x13 window. For objects that rapidly shrink or expand from slice to slice, it’s possible that there aren’t any pixel with the correct label in this small window size.

    • Evaluation on liver/spleen only - Given the constraints above, I would need more convincing that these segmentation tasks are sufficiently non-trivial with respect to both image appearance and object shape to truly demonstrate the utility of the proposed method.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Satisfactory - even though these are checked off on the reproducibility response, there are actually no details of how baseline methods were implemented and used, no variations reported (error bars or standard deviation), and no statistical significance analyses.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Alternative methods in Table 1 - do the results for experiments (3) to (5) come from the papers [40, 16, 37], or did the authors reimplement or run those methods themselves? In particular, I’m a bit wary of the results for the pseudo annotation results for the CT liver dataset, as this doesn’t seem to be an extremely hard problem and the reported results (Dice 0.63) seems very low. I would have expected this method to do better.

    • 3D vs 2D U-Net in Table 1 - Both of these use a single annotated slice, and it’s unexpected to me that the 3D U-Net does so much worse, given that it could provide additional contextual information. Is there an explanation, e.g., does it overfit more than the 2D U-Net?

    • CHAOS vs LiTS liver CT scans - I would think that they would be quite similar given that they are both CTs, but the experiments in Table 2 in which the training and testing datasets are different indicate that it is difficult to transfer from one to the other. Is the difference in field of view, contrast enhancement, etc?

    Additional small comments:

    • Abstract - “our new framework achieves better performance with less than 1% annotated data” - to me this phrasing implies that the proposed framework with less than 1% annotated data outperforms fully annotated 3D U-Net training, which is not what the authors are actually trying to say. Consider rephrasing to something like, “when less than 1% annotated data is available, our new framework achieves better performance than several baselines”.

    • Related work - This method is quite similar to conventional patch-based segmentation along with a learned distance function - the authors could consider adding this to their related work.

    • Introduction - I didn’t understand the motivation to avoid human-machine iterations, as I’d think this would be fine as long as any computations were fast enough.

    • Consider editing the self-learning loss in equation L_sche to incorporate the representative slice pairs from the screening module training stage (eqns. 2-4). As written, the loss reads as though the self learning operates on all pairs of neighboring slices and that the representative slice-pairs are unused.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main factor for my overall score is the lack of novelty plus the lack of explanation convincing me that the baseline methods were implemented correctly and still perform so badly. But, the method is still addressing a good problem and has nice results, so could still fit in at MICCAI.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors present an approach to automatically select the best 2D slice in a 3D image that needs to be manually annotated. This annotation is propagated to other slices. This yields very good results with few effort.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    clearly written good visualisation appropriate references convincing results good ablation study methodology well explained comparison to relevant work usage of good data sets

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The values of the weights is not discussed
    • Relying on one slide with ground truth may indeed influence the results. The scheduled sampling should solve this, but some examples for this would have been nice (supplementary material is missing)
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    seems to be ok

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    given the limited amount of space, the paper is self-containing and clear enough. Some more evidence, as written above, would have been nice.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    no missing parts, well written, clear results, best paper in my stack.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper presents a self-learning and one-shot learning based framework for simplified annotation of 3D medical images in image segmentation. The idea behind the work is to learn the correspondance of 2D slices in the 3D volume along with one-shot learning of the annotations in a single 2D slice. The paper is thoroughly validated using four different datasets achieving convincing empirical results.

    At some parts, the paper can be unclear requiring some further precisions or the use of more standard terms. The authors are recommended to address these points as highlighted by the reviewers.

    Finally, please comment on the poor results obtained by the baseline methods evaluated in the paper.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5




Author Feedback

We would like to thank all the chairs and reviewers for their efforts. The key comments and our responses are summarized below. R1: 1-What is ‘human-machine disharmony’ and why? (1) Previous active learning (AL) based methods iteratively conduct two steps: (i) a model selects valuable samples from the unlabeled set; (ii) experts annotate the selected samples. (2) Such a process implies that experts should be readily available for queries in each round, and that the AL process needs to be suspended until queried samples are annotated. Such asynchrony causes AL methods to be inefficient in various medical scenarios, motivating our one-shot annotation design. We’ll highlight this in our final version. 2-The meaning of ‘enormous semantics’. Inside unlabeled medical images, although we cannot know to which specific class each pixel belongs, we can group pixels in the same class by constructing pixel-wise correspondence. Such underlying correspondence could be regarded as semantics, and we argue that it is enough to attain full annotation with only one slice annotated. That’s why we depicted the semantics ‘enormous’. 3-Clarity. We will carefully revise all unclear terms. Thanks for your advice! R2: 1-The setting of the window sizes in computing pixel-level similarity. To reduce computation costs, we decided to compute similarity locally by introducing the window size. We tried different sizes and found 13x13 worked well on all datasets. In fact, we found it’s rare that organ representations change largely in adjacent slices since CT and MRI are densely represented. 2-Effects on multiclass segmentation. Thanks for your suggestive insight. We are working on multiclass segmentation tasks currently, and found the method of constructing pixel-wise correspondence is also promising in our evaluations. 3-Poor generalizability of 3D Unet. 3D Unet might be commonly trained when the training and test sets are identically distributed. In this paper, our method is especially apt to out-of-distribution test samples (with different image-acquiring protocols, contrast agents, metabolic stages, etc), in comparison with training samples. 4-Why 3D Unet performed inferiorly than 2D Unet? In this task, only one slice in each training volume was annotated. Thus, it was infeasible to exploit the advantages of 3D Unet in learning semantic dependence along the depth dimension. On the other hand, the higher model complexity of 3D Unet made it hard to train with one-slice supervision. We have experimented with utilizing multiple-slice annotation, and 3D Unet performed better. 5-Results of other compared methods. We ran the comparion methods on our datasets. The pseudo annotation method (the comparison method) requires a larger training set for its 3D network and relies highly on accurate prediction of spatial transformation. In contrast, our method is more robust due to the two training strategies (see Sec. 2.6). 6-Clarity. Please refer to R1-1 and R1-3 above. Thanks for your suggestions. We’ll revise the paper carefully. R3: 1-The weights in the objective function. We tried different values and 0.9/0.1 worked well in our setting. We’ll add the comparison in the final version. Thanks for your suggestions! 2-The choice of one slice to annotate. We proposed a screening module to select the most representative slice for annotation and adopted scheduled sampling and cycle consistency strategies to enhance the model’s robustness. All these components contributed to segmentation accuracy (i.e., 0.19, 0.13, 0.04 in DICE, respectively; see Sec. 3.4). We’ll emphasize this in the final version. To meta R: For the poor performances of 2D and 3D Unet. 2D Unet is proposed for the fully supervised setting, which works well when the labeled training set is sufficient. 3D Unet might work with sparse annotation (requiring over 10% annotated slices), but it is verified that 3D Unet cannot handle the one-slice annotation setting well and this is the advantage of our method.



back to top