Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Beilei Cui, Minqing Zhang, Mengya Xu, An Wang, Wu Yuan, Hongliang Ren

Abstract

Noisy label problems are inevitably in existence within medical image segmentation causing severe performance degradation. Previous segmentation methods for noisy label problems only utilize a single image while the potential of leveraging the correlation between images has been overlooked. Especially for video segmentation, adjacent frames contain rich contextual information beneficial in cognizing noisy labels. Based on two insights, we propose a Multi-Scale Temporal Feature Affinity Learning (MS-TFAL) framework to resolve noisy-labeled medical video segmentation issues. First, we argue the sequential prior of videos is an effective reference, i.e., pixel-level features from adjacent frames are close in distance for the same class and far in distance otherwise. Therefore, Temporal Feature Affinity Learning (TFAL) is devised to indicate possible noisy labels by evaluating the affinity between pixels in two adjacent frames. We also notice that the noise distribution exhibits considerable variations across video, image, and pixel levels. In this way, we introduce Multi-Scale Supervision (MSS) to supervise the network from three different perspectives by re-weighting and refining the samples. This design enables the network to concentrate on clean samples in a coarse-to-fine manner. Experiments with both synthetic and real-world label noise demonstrate that our method outperforms recent state-of-the-art robust segmentation approaches. Code is available at https://github.com/BeileiCui/MS-TFAL.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_9

SharedIt: https://rdcu.be/dnwOJ

Link to the code repository

https://github.com/BeileiCui/MS-TFAL

Link to the dataset(s)

https://endovissub2018-roboticscenesegmentation.grand-challenge.org/

Reviews

Review #2

Please describe the contribution of the paper

The authors propose a Multi-Scale Temporal Feature Affinity Learning (MS-TFAL) framework to address the noisy label problem in medical video segmentation. In specific, they design Temporal Feature Affinity Learning to capture average positive.negative affinities. The learned affinities are further utilized to supervise the training in a multi-scale scheme.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed method is technically sound, the paper is overall well-written and easy to follow. Ablation studies and comparison results with other state-of-the-arts demonstrate effectiveness of the method.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. In the define of t_p, t_n, what do A_p and A_n mean?
2. In the design of the final loss, the image-level and video-level weights are not applied on the label-corrected cross entropy term, is there any explanation?
3. Using contrastive learning in video segmentation to retain temporal consistency in the embedding space is commonly used in other fields [1][2], please elaborate on the differences with them. [1] Chen, Yi-Wen, et al. “Video salient object detection via contrastive features and attention modules.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022. [2] Jiang, Zhengkai, et al. “STC: spatio-temporal contrastive learning for video instance segmentation.” Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV. Cham: Springer Nature Switzerland, 2023.
4. Since pixel-to-pixel cosine similarities should be calculated every two frames, the time-complexity is likely to be high, please give more analysis on this aspect.
5. The features used for similarity calculation is not clearly described.
6. In the last group of Table 1, the results of Seq 4 on the full setting (with pixel-wise supervision) is worse than the “w/ V & I” setting, please explain
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

the reproducibility is good given that code is provided
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. In the define of t_p, t_n, what do A_p and A_n mean?
2. In the design of the final loss, the image-level and video-level weights are not applied on the label-corrected cross entropy term, is there any explanation?
3. Using contrastive learning in video segmentation to retain temporal consistency in the embedding space is commonly used in other fields [1][2], please elaborate on the differences with them. [1] Chen, Yi-Wen, et al. “Video salient object detection via contrastive features and attention modules.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022. [2] Jiang, Zhengkai, et al. “STC: spatio-temporal contrastive learning for video instance segmentation.” Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV. Cham: Springer Nature Switzerland, 2023.
4. Since pixel-to-pixel cosine similarities should be calculated every two frames, the time-complexity is likely to be high, please give more analysis on this aspect.
5. The features used for similarity calculation is not clearly described.
6. In the last group of Table 1, the results of Seq 4 on the full setting (with pixel-wise supervision) is worse than the “w/ V & I” setting, please explain
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

overall, the authors design compact framework to utilize temporal consistency in feature space for noisy medical video segmentation. However, the idea of using contrastive learning in video segmentation to retain temporal consistency in the embedding space is commonly used in other fields, thus the novelty is limited.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper focuses on noisy labels in medical image segmentation. Specifically, it learns affinity cross frames, also refines and reweights samples by the multi-scale supervision.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed multi-scale supervision is comprehensive, involving video, image, and pixel levels.
2. It makes sense to learn affinity from other frames, which suits video segmentation fine.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The baselines seems wired, some anti-noises methods perform worse than baseline for clean supervision. It should be explained.
2. The framework conducts video segmentation in the manner of image segmentation method. Some video techniques may handle the noise problem easier, eg. label smoothing across frames, or using tracking method to propagate clean annotations. Image segmentation method for video segmentation is too naive.
3. Why it’s necessary to use next frames? How about any frames in the dataset. This experiment is necessary, without it, the method is just a normal image segmentation method. Because, some WSSS methods exploit information across images too, and the model is trained and inferencing in the manner of normal image segmentation.
4. There are some writing typos, eg. in Fig.1 the affinity is for the green region but the corresponding point in feature space is in orange.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It’s not difficult to reproduce.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Conduct video segmentation via video tracking or 3D conv methods may be more reasonable than naive image segmentation. Besides, check the results of baselines and some typos.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Conduct video segmentation via image segmentation method seems naive. Videos tracking or 3D conv methods seems more reasonable, besides it, some simple operation in video operation like label smoothing may solve the noise problem well. Above settings are hard to change, while the baseline results and necessarity for adjacent frames in the weakness are easier to explain for you. If you address my concerns, I would like to change my decision.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

I raise the score because some key experiments are added. Specifically, the ablation study about next frame vs. any frame. But I will not feel pity if it’s rejected. Because, the noise rate is too exaggerated than real practice and make other baseline invalid. Besides, simple denoising methods like label smoothing cross frames are not given. Note, the table 1 in rebuttal is confused (using different data, what’s the baseline?), which cannot well support your claim on video methods.

Review #4

Please describe the contribution of the paper

This paper presents a framework to tackle the noisy label problem in video segmentation by utilizing the contextual information from adjacent frame for label correction. The pseudo labels are rectified in pixel, image, and video levels. Experiments on three video datasets show that the proposed method achieves satisfying results and robust to noise levels.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- I like the idea of considering inter-frame relationship for pseudo label correction for video segmentation. It is worthy to present this idea to the MICCAI community.
- Pseudo labels are rectified in multiple levels: pixel, image and video, which considers more comprehensive aspects for pseudo label correction.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The authors claimed that it is beneficial to use the inter-frame info to rectify the pseudo labels. Specifically, it considers the cosine similarity of the pixel of current frame to the pixels of the same/different class from the previous frame. However, in the experiments, it is not clear to me how using the information from the other frame is more beneficial than simply the current frame. To prove the advantages of using inter-frame, the authors need to compare the same affinity calculation method using the current frame only (similar to the pseudo label correction method in [1]).
- The comparison to two other backbones without noisy label learning seems weird to me. This paper is focused on the noisy label problem but in the experiments, the authors compare two other backbone networks without adding the proposed noisy label learning method. As a general noisy label rectifying approach, the authors should evaluate it on all evaluated backbones, instead of just DeepLabV3+. Or the authors could compare all noisy label correction methods using the same backbone.
- In the submission checklist, the authors said ‘yes’ on ‘an analysis of statistical significance of reported differences in performance between methods’. I do not observe any significant tests in the paper.
[1] Zhang, Pan, et al. “Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors agreed to make the code repository public.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- More comparisons would be great. Either compare to more noisy label correction methods using the same backbone, or evaluate the effectiveness of the proposed method using different backbones.
- Need to justify the necessity to compute affinity using adjacent frame. For example, compare to the affinity calculated using the current frame.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Though some experiments could be conducted better as pointed in the weakness section, this paper is overall worth to be visible to the MICCAI society.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

In the rebuttal, the authors did not well address my concerns regarding the paper. (1) why adjacent frame is better than the current frame? The authors mention that it is intuitive to use the adjacent slice as it contains time-domain info. In the follow-up experimental results, we can see that using the adjacent slice only has marginal improvement over using the same slice (60.5 vs 59.31). Without significant tests and standard deviations, it is unknown whether the difference is significant. To me, it is not straightforward to learn how the adjacent slice is better for affinity learning for noisy label problems. (2) Experimental design. As stated in the review, in the experiments, the authors compare two other sota methods for noisy label learning, and two other baseline methods w/o noisy label learning. As the authors stated in the rebuttal, comparing two other baseline w/o noisy label learning method is not very informative, as the finding ‘training w/ noisy label would decrease the performance’ is expected. As the main focus of this paper is to propose a noisy label learning method, it is natural to ask if this method is also effective for other backbones. The authors’ response in rebuttal does not answer this question.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This work proposed a video segmentation framework to handle the noisy annotations. Compared with previous works, the affinity of labels and features are aligned. The multi-scale supervision is also used in this paper. Experiments on EndoVis and RatColon datasets demonstrate the effectiveness of this method. As pointed out by the reviewers, the technical soundness of this work (why consecutive frames are used rather than a real video-based method) and the details of pipeline requires more clarifications by authors. Please carefully prepare the rebuttal to improve the manuscript.

Author Feedback

Thanks for the precious comments and suggestions! We list the main concerns and our answers below:

Why is the method based on image instead of real video segmentation backbone? (R3.MR) A: 1). Previous research explores single image strategies. We emphasize our value is to demonstrate the advantage of handling label errors using inter-image manner. Our method is a plug-in module that is not dependent on backbone type and can be applied to both image-based and video-based backbone. We have also conducted experiments with the video-based one: STswin 5 to show its generalization ability. Settings are the same with Table 1 except for backbone.

Data Method mIOU(%) Dice(%)

a=0.3 53.06 63.34

a=0.5 STswin+Ours 49.25 59.77

a=0.8 37.73 47.96

Our method can still greatly improve performance(compared to Table 1 STswin), proving that it can also improve robustness to noisy labels on video backbone. 2). Video tracking is more common in one-shot learning that requires a ground truth segmentation map for the first frame when testing. This does not match what is commonly needed in medical scenarios.Other video segmentation frameworks with video techniques may also be sensitive to noisy labels. For example, as shown in Table 1, STswin [5] is a SOTA swin transformer with 3D attention mechanism, whose performance also decreases significantly when facing dataset with noisy labels. 3). We intended to apply our method on video seg backbone, but due to:1. Training time was too long; 2. Other anti-noise methods we compared can only be applied with image-based backbone, it’s more fair if we use the same backbone; we chose current backbone as a compromise.

Why it’s necessary to use adjacent frame? (R3,R4,MR) A: It’s an intuitive idea to use adjacent frames for they have the most temporal consistency and similarity, leading to a good correlation between classes and feature similarity. In this setting can noisy labels be better identified. If we choose any frame or the same frame, they have no correlation with current frame in time domain. Choosing any frame also may contain totally different classes resulting in no positive affinity clues. We made two additional experiments (Only change the choice of frame, other settings are same with Ours in Table 1 ablation) to compare the choice of frame. Results are shown below.

Data Method mIOU(%) Dice(%)

Adjacent frame(Ours) 50.34 60.50

a=0.5 Any frame 48.69 58.89

Same frame 48.99 59.31

They all have better results compared to only backbone in Table 1, but using adjacent frame has the best performance compared to other two choices. Due to limited space, we didn’t show this comparison of choice of frame.

Why do some anti-noise methods perform worse than baseline for clean supervision? (R3) A: This phenomenon also exists in previous papers like [4] & [15]. We believe it is because: 1. some anti-noise methods have low generalization causing them be suitable for a certain type of dataset but perform badly with other datasets; 2. Some methods may have a certain assumption on noise rate thus making overcorrection with those datasets with low noise rate.

Time-complexity question. (R2) A: Training time for only Deeplabv3+ is about 5 hours while our methods is about 9 hours. Extra time mainly contains matrix multiplication for calculating affinity maps.

Why compared with two other backbones? (R4) A: We followed the same setting in [4]. This is to show that even SOTA image & video segmentation frameworks are very sensitive to noisy labels.

Difference with Contrastive learning (CL). (R2) A: CL is unsupervised representation learning making features belong to same class closer and different classes farther. Our method is designed based on the intuition that adjacent frames already have this characteristic, thus indicating noisy labels. Combing CL with our method is a promising future direction we plan to do.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This work proposed a video segmentation framework to handle the noisy annotations. The multi-scale supervision is used in this paper. Experiments on EndoVis and RatColon datasets demonstrate the effectiveness of this method. The rebuttal has addressed the issues raised by reviewers. Therefore, my final rating is accept.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors responded adequately to the reviewers’ comments. R3 raised the rating for this paper and now all the reviewers recommend acceptance of the paper. R4 still has concerns about the response of the authors regarding the use of adjacent frames and the experimental design.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

strengths: the affinity of labels and features are better aligned, multi-resolution enhancement; good empirical results weaknesses: only shown to work with frame-based methods; novelty somewhat limited how the rebuttal informed your decision: the rebuttal explains that the proposed plug-in method could also be used for video based prior approaches, thus mitigates the concerns on only working with frame-based methods.

back to top

Data	Method	mIOU(%)	Dice(%)
a=0.3		53.06	63.34
a=0.5	STswin+Ours	49.25	59.77
a=0.8		37.73	47.96

Rectifying Noisy Labels with Sequential Prior: Multi-Scale Temporal Feature Affinity Learning for Robust Video Segmentation