Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jiancheng Yang, Rui Shi, Udaranga Wickramasinghe, Qikui Zhu, Bingbing Ni, Pascal Fua

Abstract

The human annotations are imperfect, especially when produced by junior practitioners. Multi-expert consensus is usually regarded as golden standard, while this annotation protocol is too expensive to implement in many real-world projects. In this study, we propose a method to refine human annotation, named Neural Annotation Refinement (NeAR). It is based on a learnable implicit function, which decodes a latent vector into represented shape. By integrating the appearance as an input of implicit functions, the appearance-aware NeAR fixes the annotation artefacts. Our method is demonstrated on the application of adrenal gland analysis. We first show that the NeAR can repair distorted golden standards on a public adrenal gland segmentation dataset. Besides, we develop a new Adrenal gLand ANalysis (ALAN) dataset with the proposed NeAR, where each case consists of a 3D shape of adrenal gland and its diagnosis label (normal vs. abnormal) assigned by experts. We show that models trained on the shapes repaired by the NeAR can diagnose adrenal glands better than the original ones. The ALAN dataset will be open-source, with 1,584 shapes for adrenal gland diagnosis, which serves as a new benchmark for medical shape analysis. Code and dataset are available at https://github.com/M3DV/NeAR.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_48

SharedIt: https://rdcu.be/cVRwz

Link to the code repository

https://github.com/M3DV/NeAR

Link to the dataset(s)

https://github.com/M3DV/NeAR


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a method to refine mask annotations to get better segmentation/classification results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The mask annotation refinement problem that this paper tried to solve is an interesting problem;
    2. With the refined annotations produced by the proposed method, the model can diagnose adrenal glands better than the original ones.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The proposed method is not very clear. What’s the symbol a specific represent for?
    2. The novelty of the paper low. It only changes the deep implicit surface by adding an a, and it aggregates features from multiple scale.Why is a needed? And multi-scale feature fusion is a widely used operation in deep learning.
    3. The experiments are not convincing. First, it doesn’t compare with other label refinement methods. Second, for the segmentation comparison experiments (Section 4.1), the inference label uses golden annotation, while training label uses distorted annotation. Thus it is not a fair comparison with the baseline methods, since they don’t have any label correction process.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors don’t have the central tendency & variation as they claimed.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    See the list of weaknesses above.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    2

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of the paper is low, the description of the method is not clear, the comparison experiments are not conniving.

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    6

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    3

  • [Post rebuttal] Please justify your decision
    1. Even though the proposed method aims to solve an interesting problem, the technique contribution is limited and the efficiency of the method is not fully proven. The authors pointed out that ‘‘However, standard implicit surface methods are not aware of the appearance, thus the reconstructed surfaces could be misaligned with the actual boundaries. It motivates us to propose the appearance-aware implicit surface model for annotation refinement.” How efficient the single HU value can solve this challenge? More deep study is needed.

    2. The authors still did not address the question that why they don’t compare with other label refinement/smooth methods. The proposed method is only compared with Seg-UNet and Seg-FCN, which are not designed to correct distorted labels. There are many mask refinement/smooth methods, such as [1][2]. [1] Blaha, Maros, et al. “Semantically informed multiview surface refinement.” Proceedings of the IEEE International Conference on Computer Vision. 2017. [2] Morerio, Pietro, et al. “Generative pseudo-label refinement for unsupervised domain adaptation.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2020.

    3. A minor issue: the appearance a is the main contribution in the paper. However, it is not defined clearly. Does a means HU value? It would be better to give the formal definition of a.



Review #2

  • Please describe the contribution of the paper

    3D segmentation masks of medical images are often noisy. The authors proposed an appearance-aware implicit function-based method to refine the human annotations of a 3D adrenal gland dataset. The shape modeling method with a modified implicit function is technically sound. The new dataset reduces label noise for downstream image analysis.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strengths are as follows:

    1. Incorporating HU values of CT images to a part of the input of implicit function-based shape modeling method.

    2. The refinement of segmentation masks reduces noise and improve diagnosis as a downstream classification task.

    3. A new dataset with smoother shape and less noise for image segmentation and classfication.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are some moderate weaknesses:

    1. Results. x. In Table 1, the authors compare their method with Seg-FCN and Seg-UNet. The two models are trained on distorted annotations. However, it would be great if the author could show the upper bound, e.g., models trained on good annotations. As a repairment, one could consider an ensemble of multiple models to reduce segmentation noise and uncertainty. Other heuristic methods to remove noise, such as hole filling/connected component analysis might be other baselines methods. x. In Table 2, the method without using the appreance as the input works worse. It would be great if the authors could discuss potential reasons behind, e.g. does it mean the basic refinement. method does not improve the annotation?

    2. Clarifications. x. in Table 2, which view is used in the 2D method?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    all codes are available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. In Table 1, it would be great if the author could show the upper bound, e.g., models trained on good annotations. As a repairment, one could consider an ensemble of multiple models to reduce segmentation noise and uncertainty. Other heuristic methods to remove noise, such as hole filling/connected component analysis might be other baselines methods.

    2. Discussion on the limitation might improve the quality of this work. For example,

    x. The proposed approach incorporated CT intensities value to improve the reconstruct quality. Absolute intensity values contain tissue information. However, this might not work well for MR images. x. This approach works for regular shape (e.g. organs) and might not work for abnormal tissue such as lesions due to shape heterogeneity or small structure.

    1. Some typos / grammar errors x. In Introduction, “Unfortunately, such datasets are difficult to obtain in part because human annotations are known to be imperfect”. Isn’t it because that expert annotations are expensive?

    x. In Introduction, high-frequency artefacts include false positive/negative.

    x. In Introduction, MLP appears without a full name.

    x. In Sec. 2.2, ‘by changing the input of…’, ‘changing’ should be ‘including’?

    x. dice –> Dice

    x. ‘lower bound’ –> ‘minimum’; ‘upper bound’ –> ‘maximum’

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Improved segmentations based on human annotations. Good methodology contribution for shape modeling.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    The rebuttal fully addressed my concerns and provided other results. I would recommend ‘strong accept’ considering: a) the method novelty with neural implicit function, and b) the significant impact on reducing annotation error and improving downstream task.



Review #3

  • Please describe the contribution of the paper

    This paper proposes a novel appreace-aware implict surface model for ground truth repairments.

    This paper also contributes a new data set for adrenal gland analysis.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Contribution of a new dataset.

    2. The neural refinement idea is interesting, the method novelty itself might be incremental but I don’t see it becomes a major issue. This paper is addressing a common issue that the human annotation is normally not smooth and full of mistakes because of annotations on slices for instead the whole volume, surface-level correction is expected.

    3. This paper might also help active learning researchers.

    4. Extensive experiments.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    My biggest concern are on the motivation and comparisons of the proposed method:

    1. For training on the distorted data set, maybe 20% distortion won’t affect the downstream task performance and human experts only make inside of 20% distortion, it would be nice to find the distortion threshold on gold standard dataset to see how much distortion actually brings down the performance of the downstream task. For example, 30% slices are distorted or more severe level distortion. Such an analysis is expected.

    2. This method is data centric, but what if authors train a bigger classification network with more diverse and complicated data augmentation focused on the appreance and better regularisation? Wouldn’t that be a simpler way to tackle this problem? The authors need to add comparison with model with more intense data augmentation and larger model, I noticed authors are using resnet-18, larger models can learn better generalisation and the classification performance gap between the proposed method and baseline is not as big as it seemed. I strongly suggest the authors to consider this issue and I am willing to raise the score after the rebuttal if the authors address this issue.

    Minor weakness:

    1. Only used on one data set. It would be nice to see how the method applies on another data set, that would make this paper really appealing.

    2. What happens when the proposed NeAR model is used on very sparse annotations? Do you think the models will have limits on discontinuous objects and extremely small objects?

    3. No stds on table 1

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Code is attached in the supplementrary materials although I personally did not run it. The authors promise to release the dataset too.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. It would be nice to see how this method applies to 2nd data set, especially for very sparse annotations such as lung vessels or very tiny structures.

    2. It would be nice to see the comparison between this method with GNN based approach.

    3. It would also be nice to see how much human labour has been saved, for example, the authors could ask human experts to refine the annotations and compare the time cost.

    4. In table 1. what are the parameters no. of NeAR and Seg-Unet? It would be nice to see the real comparison of the perforamnce

    5. In table 1, what are the stand deviations?

    6. It would be nice to assess how realistic the synthtic distortion is and how much it affects the downstream tasks. For example, if human make 30% distortion of the ``ground truth’’ and the classification network is robust when trained with 30% distorted data, then the refinement might not be needed or someone should focus on improving the model. It would be nice to add analysis of the effect of the distortion on the downstream task.

    7. refinement of data vs refinement of model. It is necessary to add comparisons in Table 2 with model-focused methods, for example, more appreance focused augmentation plus bigger model with better representation learning ability might simply improve the classification result and maybe refinement on dataset is not necessary.

    8. Could the authors add connections between the proposed method and NeRF (neural radiance field)?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. Geometric deep learning has become more and more important in medical image analysis as the surface representation are closer to the real world ones. They deserve more attention in the community.

    2. Enough contributions in both datasets and methods.

    3. A few weaknesses need to be addressed in the rebuttal and I am willing to raise my score if the rebuttal addresses my main concerns.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    After reading the rebuttal and the other reviews, I remain my score. I agree with R1 that the novelty is limited but I found the proposed research problem is very interesting and could be an interest to a lot of audience of the community. I will leave it to the ACs to decide.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All reviewers agree the idea of using deep implicit surfaces to refine annotations is interesting and is also an important problem in medical imaging. I agree with this assessment. However, all reviewers brought up concerns as to experiments, namely: –experiments only on one dataset, –no comparisons against other label refinement approaches, –competitors only trained on distorted labels –no std dev or other numbers to indicate statistical significance.

    These are very important concerns, and I encourage the authors to address them in the rebuttal. Note, this is not a request for additional experiments, as that is not the purpose of the rebuttal. But authors should try to address these concerns in order to provide a more convincing presentation of their method and ideas.

    Please also note reviewers also brought up many valuable clarity and exposition issues that authors are also encouraged to address.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7




Author Feedback

We thank the (meta-)reviewers (MR2, R1, R2, R3) and address their main concerns below.

  1. Datasets used in experiments (MR2&R3) Experiments were conducted on 2 different adrenal gland datasets: 1) A dataset with golden standard segmentation from AbdomenCT-1K, to quantitatively analyze the performance of NeAR on repairing distortion. 2) A new ALAN dataset with diagnosis labels to demonstrate downstream application, which will be made publicly available.

  2. Fair comparisons against other methods (MR2&R1) We compared our method against others: “Seg-FCN” and “Seg-UNet” take images as input and try to output clean segmentation, while trained with distorted / manual segmentation masks. We compared their repairing performance quantitatively in Sec 4.1 and the impact on a downstream application in Sec 4.2. It is a fair comparison between Seg-FCN, Seg-UNet, and NeAR, because all these methods are trained using distorted segmentation masks in Sec 4.1 or manual ones in Sec 4.2.

  3. Std dev (MR2&R3) There are already repeated experiments in Tab 2. We also repeat the Tab 1 experiments 5 times as follows: 79.56 ± 0.29 | 78.70 ± 0.45 | 78.79 ± 0.45 | 81.07 ± 0.22 89.54 ± 0.33 | 87.71 ± 0.90 | 87.96 ± 0.51 | 91.22 ± 0.12 The conclusion remains unchanged.

  4. “The proposed method is not very clear. What’s the symbol a? The novelty of the paper is low. The results are unconvincing.” (R1) a stands for “appearance” as defined in Sec 2.2. As to novelty, deep implicit surfaces are not used much in medical image analysis because of the problems discussed in the introduction. Our approach works around them and using appearance is an important part of it. Finally, regarding the significance or our results, please see our answer #5&#6 below.

  5. Effect of distortion + larger models on downstream tasks (R3) To understand the effect of distortion on downstream classification, we further made 20% and 30% distorted ALAN dataset from human-annotated masks, and trained larger classification models (ACS ResNet50) with data augmentation (random noise + resizing). The 5-trial results are as follows: Model | NeAR (S+A) | Human | Dist 20% | Dist 30% Res18 | 91.58 ± 1.13 | 90.10 ± 0.90 | 84.34 ± 1.40 | 81.39 ± 2.58 Res50 | 90.84 ± 0.22 | 89.95 ± 0.96 | 84.35 ± 1.59 | 81.51 ± 2.58 Res50+aug | 91.01 ± 0.47 | 90.35 ± 0.47 | 84.87 ± 1.19 | 84.84 ± 0.64 Larger classification models and data augmentation improve performance on highly-distorted masks. However, they have little impact on the quality of repaired human-annotated masks. Meanwhile, repairing with NeAR improves downstream classification using several different backbones. Furthermore, visually appealing masks can be crucial in a shape modeling dataset.

  6. Human labor saved (R3) We asked an expert to manually refine 5 masks; he made ~4 min per mask. This represents ~105 h to refine all samples, which is much more than NeAR (~60 h) on 2 GPUs. Moreover, machine computing is almost free while human expert labor can be expensive.

  7. NeAR and NeRF (R3) NeAR is an implicit method for shape modeling, especially designed for label refinement; NeRF is an implicit method for neural rendering. We will discuss this in more detail.

  8. Heuristic repairing methods to remove noise (R2) We add manual smoothing as a baseline, including morphological closing and connected components filtering. We tried several settings in Tab 1. The highest dice is 76.90, much lower than neural methods.

  9. Limitation (R2&R3) Testing only on adrenal glands is a clear limitation. In the future, we will test NeAR on sparse annotations and small objects.

Other issues A. R2: To obtain the upper bound of label refinement methods, we train models on golden standard annotations. The dice for Seg-FCN, Seg-UNet, NeAR(S) and NeAR(S+A) are 83.34, 80.74, 86.02, 88.20. Thus, NeAR fits the golden annotations better. B. R2: axial view is used for 2D. C. R3: The parameter size of Seg-FCN is ~36M, and that of Seg-UNet/NeAR is <1M.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I found the authors rebuttal quite convincing. Although R1 gave a severely negative score, this was not backed up by enough concrete evidence. In particular, I disagree with R1 as to novelty, as I find the idea of using implicit functions for this application both interesting and new. Results in the paper do a good job of backing up the authors’ conclusions. I encourage the authors to include somehow the labour cost savings in their results/supplementary material, as I think this strengthens an already strong work. I view this work as a clear accept.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The post-rebuttal assessment has confirmed the trends pre-rebuttal, namely one reviewer is still in favor of rejection and the other 2 are in favor of the acceptance.

    I am not fully convinced by the arguments of R1, in particular the limited technical contribution and the unproven efficiency. What is interesting in this paper is the fact that the authors leverage the recent advance in implicit surface modeling (neural implicit representation) to solve a practical problem (smoothing annotation). Even though more comparison could be added (to other label refinement/smoothing methods, as suggested by rev#1), my opinion is the paper still remains interesting for the MICCAI community and as such I recommend acceptance for this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The key concerns raised by the reviewers are on novelty, and experimental comparison. The rebuttal gave reasonable explanation to these concerns. The rebuttal also well addressed the concerns on fair comparison. Overall, this is an interesting application with good experimental support.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



back to top