Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Xiangyu Li, Xinjie Liang, Gongning Luo, Wei Wang, Kuanquan Wang, Shuo Li

Abstract

Neoadjuvant therapy (NAT) for breast cancer is a common treatment option in clinical practice. Tumor cellularity (TC), which represents the percentage of invasive tumors in the tumor bed, has been widely used to quantify the response of breast cancer to NAT. Therefore, automatic TC estimation is significant in clinical practice. However, existing state-of-the-art methods usually take it as a TC score regression problem, which ignores the ambiguity of TC labels caused by subjective assessment or multiple raters. In this paper, to efficiently leverage the label ambiguities, we proposed an Uncertainty-aware Label disTRibution leArning (ULTRA) framework for automatic TC estimation. The proposed ULTRA first converted the single-value TC labels to discrete label distributions, which effectively models the ambiguity among all possible TC labels. Furthermore, the network learned TC label distributions by minimizing the Kullback-Leibler (KL) divergence between the predicted and ground-truth TC label distributions, which better supervised the model to leverage the ambiguity of TC labels. Moreover, the ULTRA mimicked the multi-rater fusion process in clinical practice with a multi-branch feature fusion module to further explore the uncertainties of TC labels. We evaluated the ULTRA on the public BreastPathQ dataset. The experimental results demonstrate that the ULTRA outperformed the regression-based methods for a large margin and achieved state-of-the-art results. The code will be available from https://github.com/PerceptionComputingLab/ULTRA.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_29

SharedIt: https://rdcu.be/cVRtf

Link to the code repository

https://github.com/PerceptionComputingLab/ULTRA

Link to the dataset(s)

https://breastpathq.grand-challenge.org/


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper proposes a method of predicting Tumar Cellularity in breast cancer in a way to leverage label uncertainty in the deep learning process. The network optimizes the distance between a predicted distribution with target distribution (from GT labels)., as well as the MSE between a predicted mean with target TC mean.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The claimed contributions are original. The concept of learning distribution of labels along with fusing multiple augmentations to mimic clinical uncertainty is very powerful.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Learning the distributions can be time consuming compared to single value labels.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Dataset is public Implementation / code is not provided

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    This paper presents a novel and innovative method to learn tumor cellularity (TC) using distribution instead of deterministic label. The input undergoes multiple augmentations that each is fed to a separate network branch, from which the TC score distribution is predicted. The loss is evaluated by comparing the distance (Kullback-Leibler (KL) divergence) between distributions as well as the MSE between the distribution mean. The paper includes detailed description for the methodology. It also shows a comprehensive ablation studies and comparisons to other methods. I have some minor comments:

    • How does the standard deviation is reflected into the Target Distribution in Figure 2.
    • I mean you have a fully connected network to predict the mean of the TC distribution. Do you need to deploy another network to predict the standard deviation?
    • In Equation 5, what can be interpreted is that you have a fully connected network (MBFF) that predicts the pins of the distribution immediately. I feel that there should be some sort of conditional learning between pin in the MBFF? I get that you used KL divergence to constraints the network learning for this purpose, However I would like to check if you explored other methods at this stage, for instance adding vision attention layer or so perhaps?
    • In the ablation study (Table 1), I would recommend adding further augmentations more than 3 to address the performance of this system given this parameter.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Innovative idea - novel architecture with enhanced results compared to state-of-the-art models.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    Authors propose an uncertainty-aware label distribution learning (ULTRA) framework for tumor cellularity (TC) estimation. In details, apart from directly regressing the TC score with MSE-Loss, authors model the uncertainty of TC score using normal distribution and train a multi-branch DNN to minimize the KL-divergence between the output and the normal distribution. Authors validate the proposed method on the public TC estimation dataset BreastPathQ and achieve the SOTA.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of this work goes to the novel idea. Considering the uncertainty of TC score, authors transfer the regression problem to a label distribution learning problem. They use normal distribution to model the uncertainty of the given TC score and train a multi-branch DNN to fit the normal distribution for TC score prediction.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of this paper is the limited novelty of the method. The implementation of ULTRA is very similar to the method proposed by Tang et al. [1]: (1) regarding the regression problem as a label distribution problem; (2) utilize normal distribution to model the uncertainty. It feels like that this article has just slightly changed [1] and used it on the TC score estimation task.

    [1] Tang, Y., Ni, Z., Zhou, J., Zhang, D., Lu, J., Wu, Y., Zhou, J.: Uncertainty- aware score distribution learning for action quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2020) 9839-9848

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Authors claim they will release the code. If so, the study could be reproduced.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Authors can give more introduction about label distribution learning and emphasize the advantages of transferring a regression problem as a label distribution learning problem.
    2. Authors should give more details on the generation of heatmaps shown in Fig. 3. If possible, please plot the ground truth heatmaps for comparison.
    3. There should be more innovation in the method.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A novel idea to transfer TC score regression to a label distribution learning problem. But the novelty of the proposed method is limited.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    I see the differences between the proposed method and Tang et al. [1] go to the MBFF and the multi-task learning scheme. And the idea of using label distribution learning for TC assessment is interesting. But I cannot agree with “Tang et al. can only model multiple raters when multiple scores from different raters are available.” In fact, the USDL proposed by Tang et al. [1] can model scoring variability with a single GT score.

    [1] Tang, Y., Ni, Z., Zhou, J., Zhang, D., Lu, J., Wu, Y., Zhou, J.: Uncertainty- aware score distribution learning for action quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2020) 9839-9848



Review #4

  • Please describe the contribution of the paper

    The goal of the paper is to present a novel method for estimating tumor cellularity (TC) in breast cancer on histopathology images. TC assessment by experts suffers from variability and quantifying uncertainty is key for building better evaluation tools. For this matter, in this work, the regression problem on TC value is translated to TC distributions learning, multi-rater process is reproduced by a multi-branch fusion module (each branch is feed with an augmented version of input data). TC score is still included as an additional loss term. The public BreastPathQ dataset was used for evaluation the approaches and results are improved compared to SOTA.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • interesting clinical problem
    • simplicity of the method
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • statistical analysis is rather weak for evaluating generalization
    • important parameters (sharpness of the Gaussian distribution, importance of branches, relative loss of KL and MSE losses) are set empirically and their influences on the outcome are not discussed or investigated.
    • the rationale for multiple branch is rather weak: augmented version do not really reproduce the kind of variability that would change the prediction of an expert.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The study is based on a public dataset and code will be provided. More detail could be given in the paper (e.g. exact augmentation strategy, exact architecture of MLP and FC layers).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The paper is interesting and well written. Yet the approach should be more justified to be completely convincing. The results are satisfactory but the rather weak statistical validation probably makes some comparison not statistically significant.

    Uncertainty is considered by replacing labels by distributions but the sharpness of this distribution is set empirically (how?). More medical insight (e.g. prior estimation of inter-operator variability) would be very interesting. Using augmented version of input data to reproduce variability could work in principle but here the augmented transforms (horizontal, vertical flips and elastic transforms) are probably not the ones that would dramatically change an expert assessment. Perhaps more transforms (on contrast) could be considered or samples with high discordance TC analyzed. Many hyper-parameters are set empirically as well as the procedure for predicting TC score from regression and label distribution branches. These should have been set e.g. with a CV selection on the training set and then evaluated on the validations set.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the idea and clinical importance is interesting, there are too many weakness in the paper for it to be suitable for publication. In particular, statistical analysis should be improved and importance of hyper-parameters that are empirically set investigated.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper under consideration introduces a framework for automatic TC estimation in breast cancer, so-called ULTRA, which has been tested on the public BreastPathQ dataset. The main strength of the work is 1) the uncertainty of TC score modeling, 2) use of a multi-branch fusion module by minimizing the KL-divergence between the predicted and ground-truth TC label distributions. The idea is interesting for an important application and the experimental validation is sufficient (multiple ablation studies). The results showed superior performance to regression-based methods and. However, there some points need to be accounted for which are listed below as well as by the reviewers. The authors need to provide a compelling argument regarding the similarity/differences of the present work with Tang et al. [1] as raised by R2. Another point is that should be addressed is the statistical analysis for the results (Table 1, 2, and 3) for motion correction in liver DCE-MRI. For the multi-branch networks, is N=3 the maximum? The rationale behind setting \sigma=0.04 empirically should be well-justified? Why not optimized experimentally? The same applies for \alpha in Eq(7). The authors need to well-justify how the parameters of the pipeline were chosen and discuss the choice effect on the performance. Ad-hoc choices can provide good results but not the optimum. Another point to add details how Fig.3 was generated and include the GT one? The indicated numbers are hard to see where it refers to

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7




Author Feedback

Thank all reviewers for the positive comments that the proposed method is ‘very powerful, ‘Innovative idea,’ ‘novel idea,’ ‘contributions are original’, ‘clarity and organization of this paper is excellent’, ‘comprehensive ablation studies.’, ‘SOTA results’, ‘interesting and well-written’, and ‘interesting clinical problem’. R2 strongly accepted this paper; some minor comments will be clarified as follows: 1) We use a simple yet effective way to illustrate the standard deviation. In fact, the standard deviation is reflected by the degree of TC distribution dispersions in Figure 2. 2) Thanks for the insightful suggestions; we had investigated most of them with extensive experiments (e.g., more augmentations with N>3). We did not add them in the paper since we didn’t see apparent improvements in the experiments. R3: Compared with Tang et al., the main advantages are as follows: 1) More flexible multi-rater learning strategy: our method makes it possible to model inter-rater variability with a single TC value by modeling the inter-rater variability with multiple augmentations in different branches, which is more general and flexible. However, Tang et al. can only model multiple raters when multiple scores from different raters are available. 2) Better convergence with multi-task learning: The proposed method achieved superior results with better convergence via multi-task learning (combined distribution mean regression and label distribution learning), which significantly alleviated the optimizing difficulty and objective mismatch issues. 3) Brand new application: our work is the first to model such ambiguities with label distribution learning in clinical practice, which is of great clinical significance. R3: Our heatmap generation strategy is the best solution to visualize TC estimation in the WSI, given that only some patch labels are available. We generated heatmaps on WSIs by performing patch-wise TC estimation in a sliding window fashion. R4: We performed 95% confidence intervals (CI) (Table 3), which allows detailed statistical analyses of the experimental results. Because CI provides an adequately plausible range for the true value related to the measurement of the point estimate and other essential information about statistical significance. We have performed t-test between our method and other ablations or SOTA methods. We obtained p-value<0.001 on ICC, p-value<0.005 on Kappa and p-value<0.01 on MSE for these comparisons, which indicates the superiority of our method is statistically significant. R4: All important hyper-parameters are determined by large numbers of experiments. For sharpness of the Gaussian distribution, as Gao [18] indicates: ‘In label distribution learning, setting σ close to the interval between neighboring labels is a good choice.’ In our case, the interval is equal to 0.01. We followed Gao [18] and tested σ from 0 to 6σ with 0.5σ interval, and experiments demonstrate that 4σ=0.04 achieved the best results. The branch weight (Wk) and loss weight (α) are validated by grid search. Wk = 1 is reasonable since no priors indicate the raters’ relative importance. α=1 denotes that both regression and distribution learning branches are important for our method. Moreover, more experimental results will be available from the Github. R4: The proposed augmentation version can reproduce inter-rater variability and achieve the best results. That is because the proposed augmentation can significantly change the visual pattern of the tumor bed in the microscope field, which leads to high annotation variability for different experts. Thanks for R4’s constructive suggestions; as R4 stated, ‘more transforms could be considered.’ In fact, we had tried more transforms in our experiment, and we adopted ‘horizontal, vertical flips and elastic transforms’ because this combination achieved the best results. Moreover, the proposed framework is very flexible and can add more transforms to model inter-rater variability.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal cleared most of the previous comments. I agree with R3 in that the contribution is in multi-branch feature fusion module and the multi-task learning scheme. The proposed scheme as well as Tang [1] possess the ability to model scoring variability with a single GT score. Thus, the similarity/differences between the present system and Tang et al. [1] are necessarily to be emphasized in the final version that . The same for the statistics provided in the rebuttal

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes an uncertainty-aware label distribution learning (ULTRA) framework for tumor cellularity (TC) estimation. In the rebuttal, the author address the major issues of the reviewers’ comments, including the statistical analysis and hyper-parameters. Therefore, I would suggest to accept this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper converts the single-valued breast tumor cellularity label into a distribution to leverage label uncertainty in the learning process, which is an interesting idea with demonstrated performance improvement. Although in general I think this paper has the value to be published to MICCAI audience, I would also like to point out that I somewhat share the concern with Reviewer#4 about the model generalization. As a key parameter to govern a Gaussian distribution, the hyper-parameter \sigma could play an important role in the success of the proposed method. Although the authors provided a rule of thumb in the rebuttal to determine the value of \sigma, there still lacks an ablation study about the sensitivity of the proposed model to the hyper-parameter \sigma, which, in my mind, is not trivial. I strongly suggest the authors to consider to add this experiment if their paper is finally accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



back to top