Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Ting Liu, Xing An, Yanbo Liu, Yuxi Liu, Bin Lin, Runzhou Jiang, Wenlong Xu, Longfei Cong, Lei Zhu

Abstract

This paper presents a novel deep learning system to classify breast lesions in ultrasound images into benign and malignant and into Breast Imaging Reporting and Data System (BI-RADS) six categories simultaneously. A multitask soft label generating architecture is proposed to improve the classification performance, in which task-correlated labels are obtained from a dual-task teacher network and utilized to guide the training of a student model. In student model, a consistency supervision mechanism is embedded to constrain that a prediction of BI-RADS is consistent with the predicted pathology result. Moreover, a cross-class loss function that penalizes different degrees of misclassified items with different weights is introduced to make the prediction of BI-RADS closer to the annotation. Experiments on our private and two public datasets show that the proposed system outperforms current state-of-the-art methods, demonstrating the great potential of our method in clinical diagnosis.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_45

SharedIt: https://rdcu.be/cVRuw

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents new architecture for training a cancer prediction in breast ultrasound images. This architecture has several key-elements: (1) dual-task prediction of cancer label (pathology) and the BI-RADS score (radiologist) (2) teacher-student architecture, with a potentially novel way to derive the soft labels from the teacher’s prediction (3) consistency loss on cancer and BI-RADS predictions, as the two labels are usually correlated (4) cross-class loss on BI-RADS predictions, as the 6 BI-RADS classes are ordinal, so the loss depends on the distance from the true class and the predicted class. Evaluation results include comparison to single-task state-of-the-art architectures, and an ablation study.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Adding BI-RADS as an additional task to the pathology is novel, to the best of my knowledge, as well as the consistency loss.
    2. In the presented results the proposed architecture has superior performance compared to other architectures.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. There are several parts in the paper that were unclear to me, such as the derivation of the soft labels, handling multiple images by the same patient, and handling inconsistencies between BI-RADS and pathology. See detailed comments below.
    2. It is unclear why different tests were done on the 3 datasets. The results were much stronger if the proposed architecture was showing superior results in all the three datasets - using the same tests. In other words, I would expect the comparison to state-of-the art and ablation study to be conducted in the three datasets that were used in this study.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Most of the presented methodology is clear and easy to reproduce. However, reproducing the results may be difficult due to the following unclear parts that I mentioned above, namely: (i) derivation of the soft labels, (ii) handling multiple images by the same patient, (iii) handling inconsistencies between BI-RADS and pathology.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. It is unclear how the authors handled cases in which the BIRADS (radiologist’s assessment) and pathology (cancer/benign) don’t agree, i.e., cancer with low BIRADS, or benign and high BIRADS. In many cases a suspicious finding is found to be benign, and mammography may reveal additional suspicious findings that are not shown in ultrasound (e.g. calcifications). Were such cases excluded from the training?
    2. How were the probabilities in Figure 1 computed? If they were taken from a paper, please add a citation to the reference in the caption of the figure.
    3. The computation of the soft labels is unclear: a. Equation 1 is unclear: what are the two summations on? is the summation limited to a single patient, or involves all patients? what does the “while” mean? Is there any iteration involved??
      b. The following sentence is unclear: “𝑁_i and 𝑁_j represent the total of the images being counted in corresponding sum formula.” What is the criteria of these images? Does N_j stands for N_0 and N_1, and these are the number of cancer and no-cancer patients or images respectively?
      c. The task-correlated labels are then used to train the student model that has the same
    4. How did you handle the issue of multiple images by the same patient? Was the presented evaluation (AUC^P, AUC^B, and other measures) made on the set of images or the set of unique patients? Did you perform any kind of aggregation on the images from the same patient?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The two factors that affected my decision were the unclarity of important issues in the methodology, and the inconsistent evaluation on the three datasets.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    The paper presents a framework for classifying the BI-RADS score (malignancy risk stratification) as well as true (path-proven) malignancy of breast lesions from ultrasound. The framework uses a teacher-student model, where the teacher provides soft labels to the student. The student is also required to output consistent predictions for BI-RADS and path-malignancy (Consistency Supervision Mechanism), and is penalized more strongly for greater errors (Cross-Class Loss Function). The authors demonstrate this framework outperforms current approaches, which typically train a model to jointly classify BI-RADS and path-malignancy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Very easy to understand, reasonable model design. Great datasets. Comparisons and ablations are fairly thorough. Results show substantial improvements on many metrics

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Only minor weaknesses:

    • there are 3 different techniques used in this paper, each of which is reasonable, but none of which are particularly novel or explored in great depth. they also seem to be generally applicable to many multitask learning problems (beyond medical vision), so it would be interesting to know how they have been used elsewhere
    • clinical utility is maybe unclear. Comparison to radiologist performance and analysis of failure cases would help on this front. It may be helpful to have a sense of how well radiologists can predict whether a lesion will be malignant, as well as inter-reader agreement. Is this framework also useful for mammograms?
    • I’m not familiar with multitask learning but I suspect there are stronger baselines than just “train on both tasks simultaneously”
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    No code provided. 2 public datasets, 1 private.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Would be nice to have an ablation for CSM + CCLF
    • Consider renaming Table 4 - Teacher to RepVGG-A2 to remind readers that it is the same as that row in Table 2.
    • How do you calculate AUC^B since BI-RADS is multi-class?
    • What does [3] do to get better performance on AUC?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is a solid approach for classifying malignant breast lesions. Probably most useful for clinical researchers and clinicians, but other medical vision researchers who work on multiclass learning tasks can also draw inspiration from this approach. Weaknesses are fairly minor.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The paper presents a novel use of soft-labels to improve the breast lesion malignancy prediction and BI-RADS category prediction. The soft labels are generated using a teacher network which intern trains a student network. Two loss functions are proposed, CSM to enforce consistency between the 2 tasks and CCLF to penalize large deviations in BI-RADS class.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and has good experimental validation of claims. The proposed approach shows good improvements over other baselines and can have broader applicability to other domains. Abalation study showing the gains from the proposed loss functions and student-teacher method is provided. CAM analysis shows the intended benefits of the loss functions are observed.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The equations of the proposed loss functions are not clearly specified, leading to the following confusions:

    In equation-1, it is unclear if the SLB, SLP are calculated per lesion. It is confusing while reading as the equations depict that the soft-labels aren’t a function of neither the lesion nor the image. This needs to be clearly stated.

    In equation-3, the loss always sums up to a negative value? As the terms are all probability values in the interval [0, 1]. It is more adequate to call it a reward than a loss.

    In eqn-1, its unclear why Ni and Nj are different, shouldn’t they be equal? This suggest that images of the same lesion have different BI-RADS scores?

    In equation-3, it needs to be clarified that the benign and malignant probability sum to one i.e. \(p_B + p_M = 1\) as the check for benign is \(p_B > 0.5\).

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper states relevant hyperparameters and present results on 2 public datasets. Though the code and inhouse dataset aren’t released.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The average number of images per lesion should be mentioned.

    In “Implementation Details” section the authors state that the images are resized to 224x224 but the “aspect ratios remained to unify input size” this statement needs more clarity as its not evident how the aspect ratio is maintained while resizing to a fixed size.

    In Fig-5, the CAM method used should be cited.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is novel, improves over baselines and has sound experimental validation. The results are validated on public datasets and ablation is conducted to show the utility of each component of the method.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors introduce technology to address risk stratification of ultrasound images of breasts with lesions. It is a well written paper.

    Their proposed technology, based coupled Deep Learning systems, is novel. The authors perform experiments on three different datasets and claim to generate better results than competing technologies. As this is a key claim in support of the importance of the proposed novel method, it is important to be provide all details related to the experiments and their results.

    As reflected by the reviewers’ comments, it seems that several implementation details are not clear enough and require your attention and clarification. Let me add questions regarding the comparison of performance between different algorithms. What are the confidence intervals of the reported results (e.g. in tables 2 and 3)? What statistical tests have you performed to confirm that your proposed algorithm outperformance (when the algorithm achieved those results) is not within the noise?

    Please share those details in your reply to this review, and react to the other comments of the reviewers; It will help assessing the significance of the results.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8 (somewhere in the middle)




Author Feedback

Thanks for the precious comments and suggestions! Our replies are as follows:

Meta-Reviewer #What are confidence intervals (CIs)? 95% CIs of the last row in Table 2 are: |0.95-0.97|0.88-0.91|0.84-0.89|0.90-0.94|0.87-0.92|0.88-0.92|0.92-0.94|0.65-0.69 The figures for Table 3 are as below in order: |0.91-0.96|*******|0.87-0.94|0.83-0.89|0.60-0.75|0.93-0.97| |no report |0.88-0.92|0.83-0.88|0.67-0.79|0.89-0.94|0.75-0.85|0.86-0.90|0.86-0.91 |0.84-0.94|0.87-0.97| |0.83-0.95|0.79-0.89|0.77-0.93|0.78-0.89| |no report |0.85-0.95|0.82-0.92|0.54-0.81|0.92-0.99|0.80-0.97|0.81-0.90|0.86-0.95

#What statistical tests have performed? DeLong test was applied. In Table 2, our method was better than others with p<0.0001 in terms of AUC_P and AUC_B. In Table 4, the result was improved after adding each component with p values as below: Row, AUC_P, AUC_B 2nd, 0.0318, 0.1186 3rd, 0.0160, <0.0001 4th, <0.0001, 0.0991 Detailed information was not provided by others in Table 3 thus no analysis was performed.

Reviewer 1& 3 #Unclear about computation of soft labels. In equation 1,i in [0,5] denotes BIRADS categories, j is 0 or 1 means benign or malignant. x_ij represents an input image annotated as ith and jth category of BIRADS and pathology. We ran trained teacher model on the training set, and obtained the predicted probability vectors of BI-RADS (tb’(x_ij)) and pathology (tp’(x_ij)). Soft labels of BIRADS and pathology were derived separately. To compute soft label of BI-RADS i (SLB_i), a predicted probability vector of BIRADS is summed up if predicted BIRADS result is i (tbc’(x_ij)=i) and pathology result equals to annotation (tpc’(x_ij)=j, j in [0,1]). ‘while’ means including all qualified cases. N_i is the qualified number. To calculate soft label of pathology j (SLP_j), a predicted probability vector of pathology is added up if predicted pathology result is j (tpc’(x_ij)=j) and BIRADS result equals to annotation (tbc’(x_ij)=i, i in [0,5]). N_j is also the qualified number but different from N_i because the condition is different.

Reviewer 1 #Why different tests were done? The comparison to the state-of-the-art (Table2) and ablation study (Table 4) on our dataset aimed to prove the effectiveness of our method and each component. The test on public datasets (Table 3) was to make our result more convincing. We couldn’t guarantee to correctly re-implement other methods in Table 3, so no comparison with them on our dataset was done. We didn’t present ablation test on public datasets because we thought current results might be adequate and essay length was limited.

#How to handle inconsistent annotations? First, our BI-RADS annotation quality is high (AUC=0.974). Second, CSM restricts that the prediction of BIRADS is consistent with pathology (details in page 4), which could address this issue in some extent. Hence, we didn’t specifically handle these cases.

#How to handle multi images by the same patient? The evaluation was made on images, but the dataset was divided at patient-level (details in page 5) to avoid ‘data leakage’. All images were regarded as individuals. No aggregation was involved.

#Numbers in Fig 1? Cite from [2].

Reviewer 2 #Is the framework useful for mammograms? Yes, it is not limit to image types and could be applied in any similar tasks.

#Why choose current baseline? We aim to utilize the relation between two tasks to improve performance based on training tasks simultaneously.

#How to calculate AUC_B? Similar to AUC_P, but thresholds for AUC_B are the predicted categories (integers from 0 to 5).

#How [3] get better AUC? They might benefit from huge data size (> 5 million).

Reviewer 3 #Confusion about equation 3. As your suggestion, it is nice to add ‘p_B+p_M=1’. We found a typo in this equation, it should be log(1-sum(b’(x))) instead of (1-sum(b’(x))), and the loss is always a positive value.

#How to resize images? An image is converted to square by adding 0 to the shorter edge, and then resized to 224*224.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors provided sufficient additional information in the rebuttal and addressed most of the issues raised by the reviewers. Particularly, information regarding the statistical validation of the performance of their proposed algorithm was provided which supports their claim regarding the strength of their proposed approach.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2 (Among the best papers)



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Despite some issues with presentation and experimental choices, the work has some novelty and promises added clinical value. The author rebuttal does fill some of the gaps.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper presents a framework for classifying the BI-RADS score (malignancy risk stratification) as well as true (path-proven) malignancy of breast lesions from ultrasound images. The authors addressed the reviewers’ concerns well, including “the computation of the soft labels”,“handle multi images by the same patient” and I would suggest accept this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



back to top