Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Fan Bai, Xiaohan Xing, Yutian Shen, Han Ma, Max Q.-H. Meng

Abstract

Weakly supervised methods, such as class activation maps (CAM) based, have been applied to achieve bleeding segmentation with low annotation efforts in Wireless Capsule Endoscopy (WCE) images. However, the CAM labels tend to be extremely noisy, and there is an irreparable gap between CAM labels and ground truths for medical images. This paper proposes a new Discrepancy-basEd Active Learning (DEAL) approach to bridge the gap between CAMs and ground truths with a few annotations. Specifically, to liberate labor, we design a novel discrepancy decoder model and a CAMPUS (CAM, Pseudo-label and groUnd-truth Selection) criterion to replace the noisy CAMs with accurate model predictions and a few human labels. The discrepancy decoder model is trained with a unique scheme to generate standard, coarse and fine predictions. And the CAMPUS criterion is proposed to predict the gaps between CAMs and ground truths based on model divergence and CAM divergence. We evaluate our method on the WCE dataset and results show that our method outperforms the state-of-the-art active learning methods and reaches comparable performance to those trained with full annotated datasets with only 10% of the training data labeled. Codes will be available soon.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_3

SharedIt: https://rdcu.be/cVRYG

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents an active learning method to reduce annotation cost for medical image segmentation. The method uses novel criterions for image selection. On a public dataset, it outperforms many baseline approaches from the literature.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. It is important to explore means to reduce the cost of medical image annotation. The presented method, DEAL, is an promising approach. It can use only 10% label budget to achieve comparable performance to fully supervised method.

    2. The presented method is technically sound. Its key components are clearly explained with clear formulations.

    3. This paper presents good experiments with promising performance and informative ablation study. The proposed method and eight baseline approaches are compared on a public datasets. Promising quantitative results are reported in comparison with the baseline approaches.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The presented method design is mostly empirical, lacking theoretical analysis. For example, no discussion provided to illustrate convergency of the proposed training procedure.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This paper provides source code and uses a public dataset. It should has good reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. This paper can be improved with analysis on training procedure convergency.

    2. It can be further improved if the authors can show the current formulations of model divergence and CAM divergence are theoretically optimal.

    3. Fig.1 can be improved to be more clear. For example, rearrange sub-figures into a more clear logic flow.

    4. minor issues, a) Abbreviation GI needs to be declared before using; b) typo “Tabel” in section 3.2.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper studies important problem, presents technically sound approach, and reports promising results. These are major factors lead to the positive rate.

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes an activate learning framework for weakly supervised bleeding segmentation. It proposes a new scheme to select pseudo labels or ground truth for training images effectively based on the reliability of the generated labels.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper proposes a new activate learning framework which selects pseudo labels according to the proposed CAMPUS criterion considering model divergence and CAM divergence.
    2. The proposed way to measure the divergence based on different threshold values is simple and efficient. The quality of the generated annotations is a critical problem in weakly supervised learning. The proposed method considers both the model divergence and CAM divergence.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The proposed method is based on the assumption that large probability values represent high reliability. The assumption is natural in general, but how valid is the assumption? It should be better to add more discussion or experiments about this point.
    2. The methods replaces CAM labels with generated labels when the samples have small model divergence and large CAM divergence. And the rank is represented by the multiple of model divergence and CAM divergence. Do the model divergence and CAM divergence have the same weight influence in the final result? Is there any better way to measure the influence?
    3. It will be nice to evaluate on more than one dataset in experimental part.
    4. It will be better to introduce more details about the selected baseline methods. Why choose these methods?
    5. What’s the SOTA performance on the CAD-CAP WCE dataset? Only one fully supervised result is report, no SOTA result is mentioned and compared.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Meet the requirement.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    See the weaknesses.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper brings an interesting method for active learning based weakly supervised segmentation.

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper propose an active learning pipeline that incorporates CAM-based weakly-supervised method and pseudo-label-based semi-supervised method. Experiments demonstrate that the proposed method has considerable advantages over prior works in the 10% data regime.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. A novel active learning pipeline (DEAL) that borrows idea from CAMs and pseudo labels and a carefully designed label selection criterion that fits the pipeline ;
    2. CAM maps and decoder with multiple heads/predictions to tackle the noisy predictions cause by the weakly supervision;
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The design of multiple propensities (coarse, standard, fine) lack details and justification;
    2. The training procedure, especially the design of loss function, lack intuition;
    3. The improvement is mostly on 10% data regime, while in 20% and 30% data regime the improvement is marginal; also, the evaluation is only done on one dataset;
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have mentioned that code would be public.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. As discussed in the weakness part, the design of multiple propensities needs further justification, i.e. using multiple sub-decoders in active learning, which could be viewed as model ensemble, has been shown to improve performance in prior works (e.g. [1]). What would be the performance if we simply use three identical propensities? Please also cite the relevant papers. [1] Beluch, W.H., Genewein, T., Nürnberger, A. and Köhler, J.M., 2018. The power of ensembles for active learning in image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9368-9377).
    2. The paper only mention that the three-propensity CAMs are generated by multi-threshold, how the threshold is determined?
    3. Equation 4 maximizes the L1 distance between D_c and D_f, what is the intuition between this? Should D_f be a sub-set of D_c? Maximizing the L1 distance would hurt the model performance on the overlapping region between them;
    4. Equation 5, the paper mentions that the goal is “making the boundary of the discrepancy decoders always surround …standard decoder”, how is it achieved?
    5. Training steps (2) and (3) seems contradictory, on (2), D_c and D_f is supervised to be different, while in (3), they are supervised to be the same, what is the intuition behind and why this would not cause the model failure?
    6. The writing needs to be improved and be more precise, e.g. the paper claim that the improvement is significant and other methods are “far inferior to ours”. However, comparing the performance in table 1, the improvement is mostly on 10% data regime while for the others the improvements are marginal.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper explore and exploit ideas from semi/weakly-supervised and propose a novel active learning pipeline. Experiments demonstrate the superiority of the method over prior works. On the other hand, some design needs clearer insight and in-depth justification.

  • Number of papers in your stack

    8

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper presents a label-efficient alogorithm leveraging the CAM and pseudo labels. The authors proposed a nice framework to weakly supervised learning and demonstrated the method on a public dataset and shows promising results.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

We thank the reviewers (R1, R2, and R3) and the Area Chair for their acknowledgment of our methodological contribution and for their constructive comments. We would like to clarify the following issues: To R1: Q1: The method design lacks theoretical analysis. A1: Due to space constraints, we have simplified some theoretical descriptions and omitted proofs, which we will consider adding to the final version. Q2: It can be further improved if the authors show that the formulations are theoretically optimal. A2: In fact, our contribution is to propose the concepts of model divergence and CAM divergence to evaluate uncertainty. Their formulations can be tried using different criteria, and the one we give is the best experimentally. To R2: Q1: The large probability values represent high reliability. How valid is the assumption? A1: We argue that the reliability depends on the discrepancy in probability distribution (model divergence and CAM divergence) rather than the probability value. Q2: Weight influence. A2: We did not add weights because the different criteria are directly multiplied, and the weights do not affect the ranking. We did not use additive combinations but better multiplicative combinations to avoid the influence of weights. Q3: Evaluate more than one dataset. A3: This is a good suggestion! We will consider it in our future work. Q4: Why choose these baseline methods? A4: We will clarify as follows. The Random is a generic baseline. The Dice is a naive way to estimate CAM uncertainty. The VAAL (diversity based), CoreSet (representation based), CoreGCN (representation based), UncertaintyGCN (uncertainty based), and GGS (gradient based) are selected outstanding methods with various types in active learning. We compared our method with various SOTA methods to verify the effectiveness. Q5: Only one fully supervised result is reported, no SOTA result is mentioned and compared. A5: We compared the fully supervised method under the same setting, such as ResNet50 backbone. The other SOTA methods are under different settings. To R3: Q1: The design of multiple propensities lacks details and justification. A1: Thanks for this helpful comment. Because of space constraints, it has been briefly explained in section 2.1. In detail, the standard CAM is determined by the optimal threshold 0.8. The coarse threshold is 0.75. The fine threshold is 0.85. Q2: Some detailed and constructive comments about the training procedure. A2: Thanks for this meaningful question. First, we clarify that only the predictions of the standard decoder are used for testing. The two discrepancy decoders only help select data and do not represent the model performance. Therefore we do not care about the performance degradation and expect them to have different predictions. To train the discrepancy decoders, we use maximum discrepancy loss proved effective in various papers, such as ‘Kuniaki et al., Maximum Classifier Discrepancy for Unsupervised Domain Adaptation.’ To avoid hurting the performance in the overlapping region, we make discrepancy decoders approach the standard decoder, ensuring that the discrepancy boundary always surrounds the standard boundary. Under alternating training, the seemingly conflicting training will make the model ‘force balance’ at the end, verified by our experiments. Q3: The improvement is marginal in the 20% and 30% data regimes. A3: We clarify that this is reasonable in active learning. Because as the number of annotations increases, the impact of selection will decay. Q4: The performance using three identical propensities? Please also cite the relevant papers. A4: Thanks for the insightful question. We will add references and discussions in the final version. However, there are some differences. Our multiple propensities can reflect the sensitivity of CAM. In historical experiments, we get a worse performance 0.7868<0.7947 with 10% annotations in three identical propensities.



back to top