Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Güinther Saibro, Michele Diana, Benoît Sauer, Jacques Marescaux, Alexandre Hostettler, Toby Collins

Abstract

A common difficulty in computer-assisted diagnosis is acquiring accurate and representative labeled data, required to train, test and monitor models. Concerning liver steatosis detection in ultrasound (US) images, labeling images with human annotators can be error-prone because of subjectivity and decision boundary biases. To overcome these limits, we propose comparative visual labeling (CVL), where an annotator labels the relative degree of a pathology in image pairs, that is combined with a RankNet to give per-image diagnostic scores. In a multi-annotator evaluation on a public steatosis dataset, CVL+RankNet significantly improves label quality compared to conventional single-image visual labeling (SVL) (0.97 versus 0.87 F1-score respectively, 95% CI significance). This is the first application of CVL for diagnostic medical image labeling, and it may stimulate more research for other diagnostic labeling tasks. We also show that Deep Learning (DL) models trained with CVL+RankNet or histopathology labels attain similar performance. Our code and data will be made publicly available.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_39

SharedIt: https://rdcu.be/cVRtp

Link to the code repository

https://github.com/IRCAD/cvl

Link to the dataset(s)

https://www.ircad.fr/research/data-sets/


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a comparative visual labeling (CVL) + RankNet approach to develop comparative and reliable labels for training and testing computer aided diagnostic systems.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This is the first time CVL is applied to diagnostic labeling of medical images.
    2. It provides a reliable way of generating labels for medical diagnostic tasks, which are often subjective and difficult to obtain.
    3. The authors show that deep learning models trained with these labels achieve similar performance to histopathology labels.
    4. The paper is well-written with nice, illustrative figures.
    5. There are sufficient experiments to support the findings.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. There are some typographical errors, e.g. page 6 “A drawbacks…”. Also some abbreviations are used without explicitly mentioning them, e.g., DL in the abstract. It may not be evident to all readers that DL is ‘deep learning’.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducible. Code and data will be made publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please do a thorough proof-reading of the paper.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My recommendation is based on the novelty of the task and its immense applicability in the field of medical imaging where generating good quality labels is often difficult and impedes machine learning training. The paper is very well written and can stimulate research in application of comparative visual labeling to medical diagnostic labeling tasks more.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Authors provide reasonable justifications to reviewers’ comments. My initial decision holds.



Review #2

  • Please describe the contribution of the paper

    This is an interesting study about the fatty liver disease diagnosis in ultrasound. Authors investigated problems related to the lack of proper reference labels for the development of fatty liver disease diagnosis methods. To address the problem, a comparative visual labeling (CVL) along with the RankNet method were used to improve the quality of the labels determined by three annotators. Moreover, authors trained a CNN with differently obtained labels to classify fatty livers to confirm the quality of differently obtained labels.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Authors addressed an important problem related to the fatty liver disease diagnosis. Sometimes the reference labels for the fatty liver disease diagnosis are not available. In this case, radiologist assess the US images to provide the labels. An approach to the robust image labeling would be therefore very useful.

    • Authors used the RankNet algorithm in an interesting and novel way. Authors compared the proposed approach with the conventional method.

    • 3 annotators participated in the study. Table 1 suggests that the proposed method improved the quality of the labels to some extend.

    • good reproducibility. Authors plan to release the codes and the dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • the technical novelty of the manuscript is limited, authors used well-known techniques for the ranking (RankNet implemented as a feed-forward network) and for the image classification (InceptionResNetV2 pre-trained on the ImageNet).

    • the proposed method is more time consuming compared to the conventional labeling. The usefulness of the proposed method would be probably limited for a larger dataset.

    • Authors performed multiply experiments to calculate various cut-offs, performance metrics and to determine hyper-parameters. It gives an impression that an independent test set was not used for the overall method evaluation.

    • experiments were performed on relatively small datasets, which makes it difficult to assess the usefulness of the methods (see the next two comments)

    • “Two classification CNNs were trained on Dataset 2 using SVL and CVL+RankNet labels, and tested on Dataset 1 … ROC-AUCs were 0.89 (CVL+RankNet) and 0.86 (SVL), and the difference was not statistically significant (p = 0.34)” It seems that the classification performance did not significantly increase thanks to the proposed method. The same issue is associated with the results presented in Table 2.

    • the proposed method did not improve the F1 scores for 2 out of 3 annotators, according to the McNemar’s test and Table 1. It seems that the labels obtained with the conventional approach were of good quality. For a dataset of 50 cases the 10% drop in accuracy would correspond to 5 mis-classified examples, which is after all a small number. This suggests that the proposed method improved the labels for only several cases compared to the conventional approach.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good reproducibility. Authors plan to release the codes and the dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The general idea of the work is very interesting.

    • It would be interesting to evaluate the proposed approach on several datasets from different medical imaging modalities.

    • Evaluations on a larger liver US image dataset would probably better highlight the usefulness of the proposed approach.

    • Authors separated the dataset into four groups for the evaluations, Fig. 3. I think that it would be interesting to directly related the error rate with the liver steatosis level. The mild group included cases with the steatosis level between 5% and 33%. In practice, I would expect the labeling errors mostly for the cases with the steatosis level between 5% and 10%.

    • Since the radiologists need to assess pairs of liver US images, the proposed method is more complicated and time consuming than the conventional approach. Results presented in Table 2 show that the proposed labeling method does not improve the classification performance. Unfortunately, this suggest that the conventional method would be probably preferable in this case.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Authors proposed an interesting approach to the labeling problem in fatty liver disease diagnosis. However, the usefulness of the proposed method was evaluated on relatively small datasets, which makes it difficult to assess the results. Moreover, the usefulness of the proposed method might be limited in practice for larger datasets due to the requirement to generate image pairs.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    I would like thank the authors for the detailed response to my comments. I believe that this work presents an interesting approach to the labeling problem, but I would still vote for weak reject. The main issues are that the study was performed based on a small dataset and that the proposed method did not improve the classification performance (Table 2), therefore I still think that the overall usefulness of the proposed method is difficult to assess.

    In the rebuttal letter, authors wrote that the proposed method can be used to address some disadvantages of the standard labeling (SL) technique (1 iii and iv) and that it can improve the intra-annotator agreement. However, I am not sure if this the case. In the study, the annotators could label the images for several times following the standard approach. This would enable authors to calculate the intra-annotator agreement levels and the confidence scores for each image. Authors state that there was a consistent improvement over the SL with the number of pairwise comparisons, but it is unclear if the performance of the SL technique would not significantly increase with subsequent relabelings (still requiring less work for the annotators than lots of pairwise comparisons I guess).



Review #3

  • Please describe the contribution of the paper

    The authors have proposed to use a RankNet to improve the healthy/pathological label for Steatosis detection using Ultrasound images. The inputs to the RankNet are randomly selected paired images from the dataset, and they are trained on binary labels provided by the annotators showing first or second image has the highest degree of pathology. They have thresholded the scores generated by the RankNet to acquire labels and evaluated their quality by comparing to histopathology results, outperforming the visual labels provided by the annotators. The new labels did not enhance the classification performance using CNNs.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well-organized and easy to follow. The authors have presented an innovative idea of using RankNets for their specific application of improving the label quality in Steatosis detection, which can be extended to any other usage of biomedical images for disease detection.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    According to figure 2, what is fed to the RankNet are one-hot encodings of the image indices, not the data from the images themselves, and the network creates the scores based on these one-hot encodings. I believe the authors should elaborate more on this, to clarify whether this is really the case, and if it is how it leads to the generation of better labels.

    Both datasets used by the authors are very small ones, and since the authors have not presented any information about the data stratification, I believe the networks are trained on very few samples, which in my belief, affects the credibility of the results. Authors should mention if they have augmented the dataset, or if they have trained and tested on the same dataset (For the results of Table 2, for example), which they shouldn’t have!

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    One of the datasets used in this paper is public, and the data provided about the networks structure and hyperparameters are enough to reproduce the code. Although the authors have mentioned that they will make the code public.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I believe the text should be further edited. For example, In the abstract the authors do not need to mention SVL because it is never used again in that section, and this sentence “Code and data will be made publicly available.” should be moved to probably after you’ve talked about your data and models in the Methods section. The acronym inside the parenthesis in line 12 of the second paragraph of page 2 should be SVL. And also I believe this sentence is missing a verb. Or in the following line it should be surgical “skill” assessment. And …

    For this sentence in the results section, “This indicates that the CNNs have some inherent robustness to training label errors in this task.” authors should either provide a justification or a reference.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The use of the Ranknet for this application is innovative but I think the authors should provide more information on the type of inputs to this network, and the way they have evaluated their performance.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors propose a comparative visual labeling approach to develop comparative and reliable labels for training and testing. The application is very interesting but the novelty is limited and the approach is evaluated on relatively small datasets. Please, address the points that reviewers comment on the experimental setup. Please remind that the purpose of the rebuttal is to provide clarification or to point out misunderstandings, it is not to promise additional experiments.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR




Author Feedback

We are very grateful for the feedback, and our criticism responses are below.

1) Limited dataset size (R2, R3) and McNemar significance (R2) We have used the only public dataset with ground truth (biopsy), also used in [6, 18, 28]. Indeed, avoiding the ethical, practical, and economic barriers to assemble such data is a primary motivation. The McNemar test passed for 1 annotator and with fused labels (alpha=0.05) (the latter important as it represents label quality using all information), paired F1 difference test passed for 2 annotators, and fused labels. p-values beyond 0.05 do not prove no effect, and despite the dataset size, all results taken as a whole show good evidence to support the method. Compared to standard labeling (SL): i) Label quality improved for every annotator (Table 1). Performance also improved (0.87 vs. 0.97 F1) compared to SL majority voting. ii) Fleiss’ kappa (the degree of annotator agreement over that which would be expected by chance) improved from 0.75 (‘substantial agreement’) to 0.84 (‘almost perfect agreement’) - to be added in the manuscript. iii) Unlike SL, our method gives real-valued diagnostic labels. They correlated very well with (invasive) biopsy scores in mild (most clinically relevant) cases (Fig. 3b). iv) Unlike SL, our method gives an annotator’s ROC curve (CVL-ROC). The AUCs were high for all annotators (0.99,0.95 and 0.97). For each annotator, their SL performance was under their CVL-ROC curves, and also under the CVL-ROC curves of all other annotators. We will add plots clearly showing this in supplementary material. v) The method requires more labels than SL, but a small number P of pairwise comparisons are needed per image (Fig. 4). This shows a clear trend and consistent improvement over SL with as few as P=3.

We note that the CVL-ROCs have other important uses, including better inter- and intra-annotator performance assessment, and to achieve a desired label sensitivity / specificity without needing annotator image relabeling. This would be useful in e.g. labeling mass malignancy potential where an annotator’s SL decision may be strongly subjective. We believe there is enough novelty, supportive results, and potential impact to stimulate good interest in this new diagnostic labeling approach, for other pathologies, a larger evaluation, and improvements e.g., active learning for relevant pair selection (R2).

2) Lack of CNN improvement (R2) The CNN therefore appeared robust to training data labeling errors from SL. Despite the lack of CNN improvement, there is already a large potential utility of the method: to accurately evaluate models under development, to monitor certified models, and evaluate annotator performances with more accurate labels and CVL-ROCs.

3) Multiple experiments / thresholds (R2) Experiments were to analyze the method and label errors, not to calculate / tune cut-offs or hyper-parameters. The RankNet configuration was not tuned by us (a post-hoc analysis in supplementary material showed very low configuration sensitivity on performance). Annotators did not receive any feedback about their selected RankNet thresholds during development. Thus, the validation of our central contribution (CVL+RankNet labeling) was well isolated from development. For the CNN, transfer learning with standard hyper-parameters and augmentations were used and tested with leave-one-out cross-validation (R3 - the CNN was not trained/tested on the same images). A few CNN training parameters (augmentation amount) were tuned using biopsy labels, which was fair since a) it was not tuned with our labels, b) to match state-of-the-art performance from [6] and c) there is no external dataset with biopsy labels.

4) Limited technical novelty (R2) The approach is technically simple (a plus), using existing networks from computer vision. That is common where the main contribution, such as ours, is an innovative application.

5)To clarify (R3) The RankNet used 1-hot vectors.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Given the novelty of the task and its wide applicability, I think the paper merits be accepted. It is true that the dataset (55+54 cases ) is small, but in this field the only public dataset is of 55 cases. If the authors share the data, it will be an important contribution to the community

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    After rebuttal, this submission holds at two accepts and one reject. AC reviewed the paper, reviews and authors’ rebuttal. This paper has provided some essentially interesting and could-be useful ideas which have been validated in this work. Although some writing clarity and further experimental performance issues need to be addressed in a later version, AC will vote to accept this paper for MICCAI.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper presents a fresh approach to resolving label ambiguities through a comparative visual labeling approach. This occurs often enough in ground truth labeling that improved approaches are essential. Even with training, annotators are often not sure how to classify. So doing a comparative rating with respect to other images and then using these relatively quantized ordering score to make an ordinal ranking for the labels seems a practical approach. If this code is made available, I can see it being adopted to test on wider datasets that can throw more light on the validity of the technique. In fact, if a theory can be formed from it, it could be the next generation of contrastive learning approaches.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



back to top