List of Papers By topics Author List
Paper Info | Reviews | Meta-review | Author Feedback | Post-Rebuttal Meta-reviews |
Authors
Zhe Xu, Jiangpeng Yan, Donghuan Lu, Yixin Wang, Jie Luo, Yefeng Zheng, Raymond Kai-yu Tong
Abstract
Deep learning-based medical image segmentation usually requires abundant high-quality labeled data from experts, yet, it is often infeasible in clinical practice. Without sufficient expert-examined labels, the supervised approaches often struggle with inferior performance. Unfortunately, directly introducing additional data with low-quality cheap annotations (e.g., crowdsourcing from non-experts) may confuse the training. To address this, we propose a Prototypical Label Isolation Learning (PLIL) framework to robustly learn left atrium segmentation from scarce high-quality labeled data and massive low-quality labeled data, which enables effective expert-amateur collaboration. Particularly, PLIL is built upon the popular teacher-student framework. Considering the structural characteristics that the semantic regions of the same class are often highly correlated and the higher noise tolerance in the high-level feature space, the self-ensembling teacher model isolates clean and noisy labeled voxels by exploiting their relative feature distances to the class prototypes via multi-scale voting. Then, the student follows the teacher’s instruction for adaptive learning, wherein the clean voxels are introduced as supervised signals and the noisy ones are regularized via perturbed stability learning, considering their large intra-class variation. Comprehensive experiments on the left atrium segmentation benchmark demonstrate the superior performance of our approach.
Link to paper
DOI: https://doi.org/10.1007/978-3-031-43990-2_10
SharedIt: https://rdcu.be/dnwLk
Link to the code repository
https://github.com/lemoshu/PLIL
Link to the dataset(s)
https://github.com/yulequan/UA-MT/tree/master/data
Reviews
Review #1
- Please describe the contribution of the paper
This work proposed an Expert-Amateur Collaboration method for the Left Atrium Segmentation task. It adopts a teacher-student network with different quality inputs. The experimental results show that the proposed approach works better than the previous method.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The writing quality is good and the final results are better than the previous work.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1) The paper (abstract) claims that only scarce high-quality labeled data and massive low-quality labeled data are used. However, the data set only contains 100 samples in total. How to define the ‘massive’ in the abstract? Also, the test images are only 20 samples which may not be convincing to prove the performance of the proposed work.
2) Section 2.2 mentions that the last three scales of features from the teacher model are selected for multi-scale voting. Why only the last three are selected is not clear and there is no experiment showing this selection is optimal.
3) What if the high-quality labeled images are fed to the teacher model? How will the performance be impacted?
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
No code or dataset is provided.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
Please address the weakness and questions listed above.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
5
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Learning with noisy labels with mixed-quality is a practical problem, and the paper shows a reasonably good results.
- Reviewer confidence
Confident but not absolutely certain
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
5
- [Post rebuttal] Please justify your decision
After reading the author’s rebuttal, I think the author has answered my asked questions. I vote for accept for this paper.
Review #2
- Please describe the contribution of the paper
This paper presents a method to perform left atrium segmentation by utilizing a small amount of high-quality labeled data and massive low-quality annotations. Particularly, they design a multi-scale voting strategy in feature space and an adaptive learning scheme to distinguish clean and noisy voxels.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The proposed does not require a lot of high-quality annotated for training which is easier to use in practical applications.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
In my view, I have some questions about data, methods, and experiments. For methods:
-
You set a threshold(η=0.25) for the prediction uncertainty to generate the final pseudo mask. Is it the optimal value? Normally, during the training progress, the value of uncertainty will decrease and approach 0. Accordingly, I don’t know whether the setting η=0.25 can effectively filter out unreliable predictions. I suggest you can add extra comparison experiments for using different values of η or adopting percentile values as the threshold.
-
In your method, you select the last three scales of features from the decoder to perform multi-scale voting. I want to know why you don’t extract feature maps from the encoder or select other scale features.
-
In your method, you choose the predictions with low uncertainties as the pseudo mask to generate prototypes. I understand the reason is that low uncertainty predictions are more likely to correctly classify each voxel. But I am curious about that in your multi-scale voting process, you can distinguish clean and noisy annotations. Therefore, you can combine the results with LQ labeled annotations to obtain robust pseudo masks and generate reliable prototypes for training in the next epoch. Why don’t you choose this way?
For data:
- Only one dataset is involved in your experiments. You can add another public or local dataset to enhance the persuasiveness of your method.
For details:
-
In Fig.2(a), what does the white color represent?
-
In Fig.2(b), you give an example of the dilated LQ label and the estimated noisy-label mask. I think you should also give an example of eroded LQ labels.
-
You can add some visualization results to explain why you select low uncertainty predictions as the final pseudo masks to generate prototypes.
-
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
Yes
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
Please refer to my comments in the “Weakness” section.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
4
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The novelty of the proposed method is limited. Some details of the method is unclear. The evaluation is only perfomed on a single dataset which is insufficient to demonstrate the effectiveness of the proposed method.
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #3
- Please describe the contribution of the paper
To overcome the lack of high-quality annotated data, this paper introduces a Prototypical Label Isolation Learning (PLIL) framework to learn robust left atrium segmentation. The PLIL based on the teacher-student framework that is trained on limited high-quality labelled data and abundant low-quality labelled data. This enables an effective expert-amateur collaboration. The teacher network uses the multiscale voting-based PLIL adaptive scheme to collate clean and suspected noisy labelled voxels. The student copies the teacher’s instruction for adaptive learning, and the clean voxels are further presented as supervised signals and the noisy ones are regularized.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper tackles a broad and prevalent problem in the field of medical image analysis. Noise in the ground-truth labels is common, and recognising and overcoming this is crucial to develop robust models. Therefore, any improvement can be very widely useful.
- The authors did a good job by comparing their work against multiple methods, and by evaluating the statistical significance of the improvement. The paper is well-written and makes effective use of tables and figures.
- The ablation study was done under the 4-high-quality-sample setting, and it revealed that the proposed strategy effectively finds out the clean labelled voxels as expected.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- It is of my opinion that certain parts of the methodology section are not easy to understand due to long and complex sentences.
- The authors evaluated the PLIL approach with a single task of left atrium segmentation, additional organs and modalities should have been also considered.
- The presented work evaluated corrupted the labels with low quality noises such as erosion and dilation. And removing the m-masked supervised loss during the ablation study has led to serious confirmation bias, as the PLIL method cannot handle violent noises well.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
The authors have described the details of the experimental setup and the implementation. Although, the paper uses a public dataset, the source code is not provided. Therefore, it is only possible to reproduce the experiments if the scripts are available.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
-
In the Implementation subsection, why not use the latest state-of-the-art backbones, like the nnU-Net[1] or one of the U-Net based transformers (perhaps UNETR[2] or Swin-UNETR[3])? Additionally, why were only augmentations such as random flip and rotation were applied? Why not blurring, gamma correction or elastic deformation?
-
In Table 1 it is evident that the significance of improvement (two-sided paired t-test) when measured using the Dice score decreases with increase in Set-HQ. However, when considering the HD-95 metric, the significance is present. It would be interesting to see the statistical significance, in particular, for Dice score HD-95 metrics when the Set-HQ is further increased to 8 and 10 respectively.
-
There is typographical error in page 8, in the Implementation and Evaluation Metrics subsection: decayed is repeated twice.
References: [1] Isensee, F., Jaeger, P.F., Kohl, S.A.A. et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 18, 203–211 (2021). https://doi.org/10.1038/s41592-020-01008-z [2] Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth HR, Xu D. Unetr: Transformers for 3d medical image segmentation. InProceedings of the IEEE/CVF winter conference on applications of computer vision 2022 (pp. 574-584). [3] Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H.R., Xu, D. (2022). Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. In: Crimi, A., Bakas, S. (eds) Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. BrainLes 2021. Lecture Notes in Computer Science, vol 12962. Springer, Cham. https://doi.org/10.1007/978-3-031-08999-2_22
-
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
6
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The strengths of this paper lies in its potential application and superior performance among multiple methods. The positives include the high quality of paper organisation and strong evaluation. Therefore, I recommend an accept.
- Reviewer confidence
Confident but not absolutely certain
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Primary Meta-Review
- Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
The paper tackles a challenging and practical problem - segmentation in the presence of noisy labels. All reviewers acknowledge that this is an important problem. R1 and R3 are more keen on accepting while R2 leans towards a weak reject. However all 3 reviewers point out that the dataset is limited (100 cases), the authors seem to think otherwise. Overall the consensus is that while the approach is interesting and practical, limited sample size reduces confidence in the overall findings. R2 raises some methodological concerns, but these are more suggestions rather than weaknesses. The authors will need to focus on justifying why they believe their sample size is sufficient to justify the approach.
Author Feedback
We are glad that reviewers find our work “tackles a challenging and practical problem” (All), “high-quality writing and organization” (R1, R3), “good results and strong evaluation” (R1, R3). Thanks for the constructive comments. Our responses to major concerns are as follows.
Q1 (AC): Reviewers point out that the dataset is limited (100 cases). Authors need to focus on justifying why they believe their sample size is sufficient to justify the approach. A1: (1) This left atrium MRI dataset is a well-established benchmark in semi-supervised studies [1,2], i.e., making use of limited labeled data (N) and abundant unlabeled data (M, M»N). As a relevant task, we strictly follow their data split (https://github.com/yulequan/UA-MT/tree/master/data) and the base implementation (e.g., same backbone and training protocols) for fair comparison. Thus, we believe this 100-scan benchmark can provide good evidence for validation. (2) We also agree that more datasets can provide more convincing validation. We further validate our method on the public brain MRI tumor segmentation dataset (BraTS19, train/val/test: 250/25/60). Given only 10 HQ labeled scans and 240 LQ labeled scans, the Dice scores of H-Sup/HL-Sup/UAMT/Decoupled/MTCL are 0.723/0.762/0.769/0.784/0.791. Our PLIL achieves 0.815, only 0.044 behind the upper bound (0.859). The complete results will be included in our extension due to limited space of MICCAI.
Q2 (R1&R2): Why not extract feature maps from the encoder or select other scale features? A2: The decoder features, already enriched by skip connections with encoder features, capture contextual information and proximity to the final segmentation. This facilitates generating prototypes that capture discriminative characteristics and task-relevant semantics. Selecting the last three layers leverages the network’s hierarchy to extract more semantically rich and segmentation-relevant features. Only selecting three is for computational efficiency.
Q3 (R1): What if the high-quality (HQ) labeled images are fed to the teacher model? A3: The student model is trained via back-propagation, while the teacher model is not trainable and gradually updated using exponential moving average of the student’s weights. We expect the HQ labeled images directly optimize the student model. The teacher model is self-ensembling by nature, providing more stable feature space that helps isolate noisy voxels.
Q4 (R2): Suggest adding comparison for using different η or adopting percentile values as the uncertainty threshold. A4: We use the normalized predicted entropy as the uncertainty. Thus, its range is [0, 1] and the current η=0.25 is already a percentile value. Experiments show similar results with {0.2, 0.25, 0.3}, and η=0.25 obtains slightly better results.
Q5 (R2): Why not combine the label noise identification results with LQ labels to obtain robust pseudo masks and generate reliable prototypes for training in the next epoch? A5: Directly combining results with LQ labels is not ideal due to imperfect noise identification shown in Fig. 2, especially at the early training stage or violent noises in LQ labels. Instead, we use model uncertainty to help generate pseudo labels for reliable prototype generation. The proportion of low-uncertainty voxels may be small initially but can be dynamically adjusted as training progresses, offering greater adaptability.
Q6 (R3): Removing the m-masked supervised loss in ablation study led to serious confirmation bias. A6: The removal of m-masked supervised loss denotes that we discard the guidance from identified clean voxels. The degradation does not mean confirmation bias but reveals that our strategy effectively identifies the clean voxels and their additional clean supervision is important.
Ref: [1] Yu et al. Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation. MICCAI’19 [2] Wu et al. Semi-supervised Left Atrium Segmentation with Mutual Consistency Training. MICCAI’21
Post-rebuttal Meta-Reviews
Meta-review # 1 (Primary)
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
Only one reviewer responded to the responses from the authors but was satisfied with the responses and voted to accept. In reading the arguments by the authors, it is clear that they have addressed the concerns raised.
Meta-review #2
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
Overall, the authors have done a good job in addressing the concerns raised by the reviewers. Notably, the inclusion of additional experimental results on the BraTS19 dataset further strengthens the supporting evidence for the proposed method. Based on the authors’ thorough responses and their demonstrated commitment to addressing all reviewers’ questions and concerns in their rebuttal (and to the revised manuscript), I recommend accepting this paper.
Meta-review #3
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
This work proposed a novel semi-supervised method the Left Atrium Segmentation task using a teacher-student network with different quality inputs.
The reviewers raised major issues regarding the use of limited data and some methodological questions. The rebuttal has clarified the methodical questions and the authors has added another public dataset to demonstrate their method on a different application. This is a borderline paper, with interesting methods but result evaluation is limited due to the small size database. The final version should be updated to include the latest results on the BraTS19 database.