Authors

Tae Soo Kim, Geonwoon Jang, Sanghyup Lee, Thijs Kooi

Abstract

As deep networks require large amounts of accurately labeled training data, a strategy to collect sufficiently large and accurate annotations is as important as innovations in recognition methods. This is especially true for building Computer Aided Detection (CAD) systems for chest X-rays where domain expertise of radiologists is required to annotate the presence and location of abnormalities on X-ray images. However, there lacks concrete evidence that provides guidance on how much resource to allocate for data annotation such that the resulting CAD system reaches desired performance. Without this knowledge, practitioners often fall back to the strategy of collecting as much detail as possible on as much data as possible which is cost inefficient. In this work, we investigate how the cost of data annotation ultimately impacts the CAD model performance on classification and segmentation of chest abnormalities in frontal-view X-ray images. We define the cost of annotation with respect to the following three dimensions: quantity, quality and granularity of labels. Throughout this study, we isolate the impact of each dimension on the resulting CAD model performance on detecting 10 chest abnormalities in X-rays. On a large scale training data with over 120K X-ray images with gold-standard annotations, we find that cost-efficient annotations provide great value when collected in large amounts and lead to competitive performance when compared to models trained with only gold-standard annotations. We also find that combining large amounts of cost efficient annotations with only small amounts of expensive labels leads to competitive CAD models at a much lower cost.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_25

SharedIt: https://rdcu.be/cVRtb

Link to the code repository

https://github.com/tk-lunit/miccai2022-annotation-cost

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

A large-scale analysis with 120K chest X-rays is conducted to analyze how annotation cost impacts CAD systems for classification and segmentation. Useful conclusions include: 1. bounding box annotations are as useful as the accurate contours when provided as additional supervision to the classification model. 2. Relatively small improvements to the label extracting algorithms lead to gains in classification performance. 3. We can achieve strong segmentation performance by mixing image-level labels with only small amounts of pixel-level contour labels.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The findings are useful for the community to improve accuracy with minimum annotation cost.
2. The study is conducted on a large dataset with manual labels and comprehensive results are reported. The quantity, quality, and granularity of annotations are analyzed for both classification and segmentation.
3. I enjoy reading the paper. It is clearly written.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The way to leverage bounding boxes in this paper can be improved. The boxes are directly converted to noisy pixel annotations to train the segmentation head, which is not optimal. There are many weakly-supervised segmentation algorithms that can better utilize the boxes. Thus, the comparision of image-wise labels and box labels in the segmentation task may be debatable. c.f. Tajbakhsh, N., Jeyaseelan, L., Li, Q., Chiang, J., Wu, Z., & Ding, X. (2020). Embracing Imperfect Datasets: A Review of Deep Learning Solutions for Medical Image Segmentation. Medical Image Analysis. http://arxiv.org/abs/1908.10454 Tang, Y., Cai, J., Yan, K., Huang, L., Xie, G., Xiao, J., Lu, J., Lin, G., & Lu, L. (2021). Weakly-Supervised Universal Lesion Segmentation with Regional Level Set Loss. MICCAI. http://arxiv.org/abs/2105.01218
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The results are convincing. However, the paper will be more impactful if the authors can release the dataset.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

For tables 2~4, the results will be easier to interprete if they are shown in line charts or bar plots.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is clearly written. The conclusions are useful and convincing. However, the weakly-supervised segmentation algorithm can be improved.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This is quite a good paper that analyze the cost of annotations for two tasks of chest X-ray analysis: classification and segmentation, from three dimensions: annotation granularity, annotation quality, and annotation quantity. The experiments are extensive and convincing with several interesting findings that could inspire further studies on the same field.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper extensively analyze (two tasks, four different kinds of annotations, dozens of different settings) the effects of different annotations for chest X-ray analysis with interesting findings, which is a practical evaluation.
2. All models are run three times with mean and std reported. I quite appreciate this point that makes the results quite convincing.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. When comparing the segmentation models, is the ground truth all contours? If that is the case, and if the authors also using the same dice + bce loss during training for all models, then using Bbox can also be regarded as using noisy labels for the segmentation task, as the authors have also mentioned in the paper. I think the comparison in Table 4 is mainly on the impact of quality, not granularity.
2. Some details and discussion could be made to improve the paper.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Would be good if the authors release the code as they claimed in the list.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. Fig.1 contains many symbols, and it’s hard to read separately from the main text. It’s recommended that the authors add necessary explanations to all the symbols in the figure for better readability.
2. In the Learning Objective, L_{cls} and L_{seg} using the same symbols for prediction and GT. Please distinguish the different output and GT.
3. Please add a brief explanation for L_{cls} in Eq.2.
4. Table 2: why the authors said the models trained with less than 12K training samples in total all performed similarly? There are clear differences between the models’ performance as shown in the table under different dataset size. Plus, it’s recommended that the authors replace “Dataset Size” to “Training Dataset Size” in the Table.
5. In “Improving Performance by Mixing Levels of Granularity”, the authors mainly describe the observations without further explanation. It would be appreciated if more discussion or explanation of the observations are provided.
6. Fig 2: it’s recommended that the authors also put results of using pure Bboxes in this plot, and discuss why the performance is the worst.
7. There are few related works to this paper. I would appreciate it if the authors could discuss the differences in findings between the following works and theirs: [a] Zlateski, et al. “On the importance of label quality for semantic segmentation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [b] Luo, et al. “Rethinking annotation granularity for overcoming deep shortcut learning: A retrospective study on chest radiographs.” arXiv preprint arXiv:2104.10553 (2021).
8. If there would be an extension of this paper, I would appreciate it to see: (a) a quantitative analysis of the cost of annotations under different scenarios; (b) what’s the best strategy when the annotation cost is limited. The authors can also refer to: [b] Ren, Zhongzheng, et al. “UFO2: A Unified Framework Towards Omni-supervised Object Detection.” European Conference on Computer Vision. Springer, Cham, 2020.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall the experiments are extensive, and the results are convincing and insteresting.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper investigates the role of quantity vs. quality of annotations for medical image classification. The authors have collected an impressive private dataset of Chest X-rays which is close in size to CheXPert (120K vs. 224K images), but is manually annotated by radiologists. In addition, pixel-level annotations are provided. Specifically, different architectures are trained for chest X-ray classification and segmentation with different ground truths: expert labels (gold standard), expert labels + random noise, expert labels + segmentation, expert labels + bounding boxes. Their key take-away points are that i) noisy labels affect performance especially when training on smaller dataset, ii) providing lesion-level annotations improves performance at all scales and iii) mixing a small number of gold standard annotations with a larger dataset of imprecise annotations can improve.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper addresses one of the most important aspects of training deep learning models in the medical domain: how to properly budget for the cost of annotation
- The paper investigates not only the issues of noisy labels, but also the added benefit of providing lesion-level annotations. Previous studies in literature, such as the mammography Dream challenge, stressed that training from image-level labels alone results in lower performance. The present paper, however, explores the topic in a more quantitative and systematic fashion
- The paper is overall well written and the experiments comprehensive
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The main weakness is the noise model: the authors generates error randomly at a given F1 score, whereas it is plausible that the errors in NLP labelers, such as that used in CheXpert, are not randomly distributed across classes. For instance in CheXPert the percentage of uncertain labels is not evenly distributed. The chosen F1 score is also relatively low (0.8), which appears to be low compared to the F1 scores declared in the CheXpert paper
- Statistical analysis is not conducted to verify whether differences are statistically significant
- Some parts of the methodology are not clearly described – in particular the annotation process and some aspects of the training methods.
- Related works are not discussed.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The dataset is unfortunately private, but the methodology is clear and key hyper-parameters are reported. The authors will release the code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
In this paper, the authors investigate the role played by granularity, quantity and quality of the annotations and conduct a series of experiments to identify the best trade-off between accuracy and reducing the annotation costs. A common trend in literature is to exploit NLP to automatically extract labels at scale: this strategy is current exploited by most state-of-the-art public datasets in chest x-ray classification (CheXpert, Chest X-ray14). It is cost-effective but inevitably leads to a certain level of noise, despite advances in the NLP labelers. The results are, to the best of my knowledge, in qualitative agreement with previous literature in the medical and general CV literature. The most interesting contribution, in my opinion, is how to mix image-level and pixel-level annotations for both segmentation and classification, proving that lesion-level annotations can substantially improve performance regardless of the training set size (at least within the range considered in the paper, from 1.2K to 121K) The impact of noise is, in my view, less relevant, since the noise model considered may be too simple. The practical importance of the findings greatly improved if the noise model was backed up by an analysis of the actual errors made by an NLP such as the one used in CheXPert (https://github.com/stanfordmlgroup/chexpert-labeler). Non-random, structured or feature-dependent noise is likely to have a far greater impact than postulated (https://arxiv.org/pdf/2003.10471.pdf). Previous studies postulate that, in CheXpert, the ‘No finding’ class is the noisiest one (https://arxiv.org/pdf/2103.04053.pdf). This limitation is appropriately addressed by the authors in their conclusions. A few aspects should be further clarified:
- Are the networks used in the experiments pre-trained on ImageNet?
- The notation used in Section 2.1 - learning objective is not very clear, as y^n is used to indicate both the class level and pixel level predictions. It would be clearer to differentiate the notations
- How is the segmentation network trained from image-level annotations only, especially in the experiments depicted in Fig.2 in which different images had different label granularity?
- How were the types of abnormalities selected for annotation? Less categories are used than CheXpert or other benchmarks: is there a particular reason?
- In Table 1, it is interesting to note that the distribution of different types of findings is largely different from CheXpert – for instance consolidation appears in 35% of the images, whereas in CheXpert it appears only in <7%. I wonder whether these differences due to the population or to the labelling method.
- In Table 1, I would report also the number of cases with no findings
- In Table 2, the authors report the average of three independent run. I would specify whether data subsampling was repeated for each round. Given the level of data imbalance it is likely that randomly selecting a small subset of the training set will result in an insufficient number of samples for the less frequent classes.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper addresses a fundamental practical problem and, while the results are empirical, they provide several pointers for practitioners to reduce annotation costs. The proposed strategies could be easily replicated for other pathologies, although the same performance benefit may not be replicated.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
This paper recevied three consistent and detailed accepts and AC checked the overall quality of this submission. This paper is recommended for Provisional Accept. Please address adequately all constructive comments and suggestions by three reviewers in the final version, including and beyond:

“1. Fig.1 contains many symbols, and it’s hard to read separately from the main text. It’s recommended that the authors add necessary explanations to all the symbols in the figure for better readability.
1. In the Learning Objective, L_{cls} and L_{seg} using the same symbols for prediction and GT. Please distinguish the different output and GT.
2. Please add a brief explanation for L_{cls} in Eq.2.
3. Table 2: why the authors said the models trained with less than 12K training samples in total all performed similarly? There are clear differences between the models’ performance as shown in the table under different dataset size. Plus, it’s recommended that the authors replace “Dataset Size” to “Training Dataset Size” in the Table.
4. In “Improving Performance by Mixing Levels of Granularity”, the authors mainly describe the observations without further explanation. It would be appreciated if more discussion or explanation of the observations are provided.
5. Fig 2: it’s recommended that the authors also put results of using pure Bboxes in this plot, and discuss why the performance is the worst.
6. There are few related works to this paper. I would appreciate it if the authors could discuss the differences in findings between the following works and theirs: [a] Zlateski, et al. “On the importance of label quality for semantic segmentation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. [b] Luo, et al. “Rethinking annotation granularity for overcoming deep shortcut learning: A retrospective study on chest radiographs.” arXiv preprint arXiv:2104.10553 (2021).
7. If there would be an extension of this paper, I would appreciate it to see: (a) a quantitative analysis of the cost of annotations under different scenarios; (b) what’s the best strategy when the annotation cost is limited. The authors can also refer to: [b] Ren, Zhongzheng, et al. “UFO2: A Unified Framework Towards Omni-supervised Object Detection.” European Conference on Computer Vision. Springer, Cham, 2020.”
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

2

Author Feedback

We would like to thank the reviewers for their thorough review of our work. We are delighted that the reviewers found our manuscript clearly written (R1), experimental results convincing (R1, R2) and our findings regarding the annotation cost of building CAD systems for chest radiographs interesting and useful (R1,R3). We address the comments and questions from reviewers where we refer to Reviewer #N as R-N.

(R-2) Comments regarding results of Table 2 and clarification on our observations.

Our observation is that the classification models trained with different levels of label granularity perform similarly given the same amount of available training data. For example, the AUCROC values across the four models trained at 6K samples with different forms of ground truth labels are within the confidence bounds of one another. We admit that we can improve the clarity of the text in Section 3.2 of the original manuscript to prevent possible confusions as raised by R-2.

(R-2, R-3) Confusion regarding notations used in Figure 1 and Equation 2.

As edits to the figure will not take up extra space, we can definitely include additional explanations to all the symbols in the figure itself to improve readability. We will also distinguish the notations for prediction and GT in equation 2.

(R-1, R-2) Discussion regarding the poor performance of using bounding boxes instead of contours for training segmentation models.

As we pointed out in the paper and also mentioned by R2, our intuition regarding the bounding box annotations is that they are noisier versions of the contour ground truth labels. The effects of lower granularity and added levels of noise of bounding boxes both contribute towards lower performance of the resulting segmentation model.

(R-2) Discussion regarding improving performance by mixing levels of granularity

As suggested by R2, we have plans to further investigate how our findings actually translate into the real cost of building CAD systems. Instead of comparing different granularity of labels under a constant budget in total annotation duration (in seconds) as in [a], we intend to relate CAD performance to real monetary cost by taking into account the market price of acquiring annotations from radiologists. This way, we intend to provide useful guidance to practitioners building CAD systems for chest radiographs.

[a] Ren, Zhongzheng, et al.

(R1, R2) Discussion on how our findings relate to the literature

The main observation from [b] is that competitive performance can be obtained by using a large number of coarsely labeled data with a small number of finely annotated images when training segmentation models. The main conclusion of [b] is that it is more cost effective to focus on generating large amounts of coarsely labeled examples than spending the same amount of time to label precisely at pixel-level. The findings of [b] are consistent with our results reported in Figure 2. Moreover, the results further corroborate our narrative that investing time to reduce label noise is a worthy investment for producing high performing classification/segmentation models in a cost effective manner.

The key takeaway from [c] is that a model trained using only image-level annotations (CheXNet) compared to a model trained using localization ground truth in the form of bounding boxes (CheXDet) shows worse generalization performance when tested on test images from new institutions. One key difference between CheXNet and our classification model is the classification annotation ground truth source. To the best of our knowledge, CheXNet is trained using image-level annotations extracted from associated text reports whereas our annotations are obtained from radiologists. The labels of the test sets in [c] come from radiologists and the discrepancy between the labeling methods used in training and testing may contribute to the generalization performance gap reported in [c].

[b] Zlateski, et al. [c] Luo, et al.

back to top

Did You Get What You Paid For? Rethinking Annotation Cost of Deep Learning Based Computer Aided Detection in Chest Radiographs