Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yang Hu, Korsuk Sirinukunwattana, Kezia Gaitskell, Ruby Wood, Clare Verrill, Jens Rittscher

Abstract

Previous efforts to learn histology features that correlate with specific genetic/molecular traits resort to tile-level multi-instance learning (MIL) which relies on a fixed pretrained model for feature extraction and an instance-bag classifier. We argue that such a two-step approach is not optimal at capturing both fine-grained features at tile level and global features at slide level optimal to the task. We propose a self-interactive MIL that iteratively feedbacks training information between the fine-grained and global context features. We validate the proposed approach on 4 subtyping tasks: EMT status (ovarian), KRAS mutation (colon and lung), EGFR mutation (colon), and HER2 status (breast). Our approach yields an average improvement of 7.05% ~ 8.34% (in terms of AUC) over the baseline.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_13

SharedIt: https://rdcu.be/cVRrc

Link to the code repository

https://github.com/superhy/LCSB-MIL

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    In order to bridge the divide of usually two separate steps, i.e. tile embedding and feature integration, the paper proposes an alternative optimization method to fine-tune the CNN encoder and to learn attention pooling. The CNN encoder is fine-tuned on three sets of tiles from the decomposition of WSI tiles into (1) attention, (2) supplementary, and (3) negative tiles with respect to their attention scores.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method achieves the best performance (AUC score) on all four benchmarking datasets; the paper is well-written and the analysis is performed thoroughly.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    One weakness of the paper is that the proposed method mainly focuses on attention pooling/integration from tiles, and thus restricts itself to binary classification tasks. Binary classification is a relatively easy task compared with instance segmentation. While instance, e.g. tumor, detection and segmentation is a crucial step towards WSI classification, this method may not be used to improve instance segmentation for WSI. Again, in Table 1, only the AUC score is reported for the classification task. What is the accuracy of each method on each dataset? Fig. 2 is confusing. What does “subtype and non-subtype” mean in this figure? Should it be e.g. tumor and non-tumor?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    On page 6, it states that “more details are available in the (temporarily anonymous) source code”, but its URL is missing.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    On page 3: “…we use k representative tiles with high attention scores to fine-tune CNN encoder f_res…”: It is not sufficiently clear how this f_res is fine-tuned, since it is pre-trained on ImageNet. It would be better to also give a brief description here. On page 4: k^1 and k^2 are better to be k_1 and k_2.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    An interesting paper but contains some flaws.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    The authors propose a self-interactive multi-instance learning framework for predicting molecular trait. Specifically, the backbone and aggregation network are optimized alternately for fine-grained and global feature, respectively, where an instance selection strategy and adversarial optimisation are further proposed. The authors validate their methods on multiple genetic and molecular analyses and achieve promising results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The motivation of this paper is clear and reasonable.
    2. The paper is well-organized and the method is clearly introduced and demonstrated by figure 2.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lack of ablation experiments demonstrating the effectiveness of three kinds of instance selected.
    2. Lack of experiments for hyper-parameter sensitivity, e.g., the number of attention tiles, supplementary tiles and low-attention tiles.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • One question about dataset: why spliting training and test sets can alleviate the small sample size problem?
    • More experiments: (a) Quantitative ablation experiments for investigating the different parts of selected instances (i.e. the attention tails, supplementary tails and negative tails) are necessary. (b) Analysis of hyper-parameter sensitivity, e.g., the (defined/sampled) number of three kinds of tiles, the weight of adversial training and the chosen of L_final/L_init. (c) Since selected instances are further used in fine-tuning the backbone+fc, one can directly use the output of fc for instance selection, which seem more reasonable than attention as it is unconstrained.
    • Visuliazation: (a) T-sne: Instead of increasing the number of top tiles, it would be help to visulize the lowest-attention ones, as this is the difference between inter and adInter training. (c) As attention score is generated after softmax, it would be help to clarify how blue-green-yellow-red/blue-white-red colour map are defined.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method is well-motivated and the experimental results are convincing.

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The authors of this paper present a method for the prediction of molecular traits or biological types from WSIs. In particular, they present a multi-stage method that iteratively fine-tunes a WSI-tile attention module and a feature extraction one. The authors present results on four classification endpoints and report better AUC scores to a number of competing methodologies from the literature.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is very well written, easy to follow with and with a clear presentation of results.
    • I find the multi-stage framework for both feature and endpoint learning quite interesting and well motivated considering the current literature on WSI classification.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • There are some weaknesses in the experimental configuration that need clarification (refer to details comments).
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Publicly available datasets.
    • Code release promise.
    • Clear description of they utilized hyper-parameters.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • I am concerned with the statement: “The training and test sets are generated using bootstrapping for 10 folds …”, to my understanding this scheme would let the same samples to fall in both training and testing sets, since bootstrapping is performing sampling with replacement. In this case, there will be some bias in the presented results.
    • The authors state that they did not utilize a validation set in order to select their models and instead they just utilize the last snapshot of the model for testing. This is quite tricky since in fact there is no certainty that the different models converge in a similar manner or even that their convergence is stable. Hence, it could be the case that a competing method would reach similar performance with the proposed with a proper train/val scheme.
    • I believe that AUC can be quite cryptic in terms of classification performance, considering also the fact that it is quite low in some cases and there is the extra concern of data leakage to the test set. I would suggest the authors to complement it with additional metrics like BACC, F1, Sens, Spec as well as the ROC curves.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It is mostly based on some weak points in the experimental configuration. Even though the overall experimental structure can be sufficient, there are a number of points that need clarification before acceptance.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes a self- interactive MIL that iteratively feedbacks training information between the fine-grained and global context features, which is rather interesting and well-motivated for WSI analysis. Reviewers generally praise the paper (well-written paper & interesting multi-stage framework), yet have concerns in experimental configuration (R3), hyper-parameter sensitivity (R2) and ablation studies (R2). Authors should address these issues in the rebuttal. Additionally, authors encourage to release the source code to increase reproducibility.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3




Author Feedback

We would like to thank all the reviewers and the AC for their constructive comments. All the reviewers unanimously agreed upon the merits of the paper: 1) the proposed self-interactive MIL method (Inter-MIL) is novel, clearly motivated and technically sound, 2) extensive experimental results are present, and 3) the paper is well-written. Specific concerns regarding the experimental configuration (R3), hyperparameter sensitivity (R2), and ablation studies (R2) are addressed below. Major comments: 1) Train/Test sets (R3): ‘Bootstrapping’ here means randomly partitioning the data several times with no replacement. Each time, we randomly select 70% of all samples for training and use the rest 30% of samples for testing. Thus, the train and test sets do not overlap. The proportion of pos/neg classes remains the same in both train/test sets. 2) Training stopping point (R3): Without the validation set, we use L_final (the training loss threshold) to decide the stop point. We picked L_final based on the optimal performance of the baseline model (Gated-AttPool) rather than Inter-MIL (ours). We also tested and found that L_final is optimal for other competing methods like CLAM, FocAtt. While the selected L_final maybe suboptimal for Inter-MIL, yet Inter-MIL consistently performs better than all competing methods. 3) Ablation study (R2): We conducted an ablation study to assess the effect of different types of attention tiles on the model performance: A) only attention tiles, B) both attention and supplementary tiles, and C) attention, supplementary and negative tiles. We reported B and C in the original submission. The AUC results of A are as follow: OV-EMT (68.53±0.21); COLU-KRAS (64.01±0.3); LU-EGFR (67.99±0.02); BR-HER2 (62.00±0.07). These results will be added to the revised paper. 4) Hyper-parameter sensitivity (R2): we tested the hyperparameter sensitivity of the AUC on the OV-EMT task. The following parameters are considered: k1, k2, n, and weight of the adversarial training. For inter-MIL, we tested different values of k1 and k2. fixed k1=50, set k2=50 (AUC=67.31±1.99); k2=10 (AUC=70.11±0.44); fixed k2=20, set k1=20 (AUC=67.34±1.88); k1=100 (AUC=69.05±1.00); For adInter-MIL, we varied n. fixed k1=50 and k2=20, set n=20 (AUC=66.20±0.55); n=5 (AUC=72.22±0.32); For adInter-MIL, we tested different adversarial training weight (gamma^neg): fixed k1=50, k2=20 and n=10, set gamma^neg=10^-3 (AUC=64.69±0.29); gamma^neg=10^-5 (AUC=71.78±0.79); The above results suggest that the model is less sensitive to k1, k2, and n, while a large gamma^neg can negatively affect the performance. The setting of L_final is mentioned in Answer 2) and we define L_init = L_final + 0.15. 5) Scope of Inter-MIL (R1): Inter-MIL can easily be extended to multiple-class classification by changing the outputs number of the final classification layer with a multiclass cross-entropy loss. The attention tiles, supplementary tiles and adversarial tiles can be selected in the one-vs-rest fashion. Instance segmentation is beyond the scope of this work. We believe the technique can be applied in other settings and it will be a valuable contribution to the community. Minor concerns: 6) Reporting of accuracy (R1, R3): we omitted the balance accuracy (BACC) in the paper due to space limitation. Here we highlighted some BACC results: On OV-EMT: Gated-AttPool (70.45±1.87); CLAM (69.00±1.08); Inter-MIL (ours) (84.86±0.45); adInter-MIL (ours) (85.45±0.48) On COLU-KRAS: Gated-AttPool (56.77±0.51); CLAM (60.69±0.73); Inter-MIL (ours) (64.50±0.45); adInter-MIL (ours) (65.85±0.24). Inter-MIL and adInter-MIL consistently perform better than the other competing methods. We will report the full BACC results in the revised paper. 7) Source code (R1): the URL for the source code is anonymized following MICCAI’s double-blind reviewing guideline. 8) The expression ‘To alleviate the sample size problem’ on Page-5 is misleading and should be changed to ‘To ensure enough training samples’.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Authors address most of concerns in the rebuttal and also promise to release the code. I remember to accept the paper

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2 to 3



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposes a self-interactive MIL framework with multiple fine-tune methods for predicting molecular traits. The paper is well-written and the motivation is sounded. The rebuttal has provided more analysis and discussions, which should be included in the final version.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper presents a self-interactive multi-instance learning method. The recommendation from the reviewers is in general positive. I think the rebuttal has addressed the concerns about the rigor of experimental design and details about the methodology. I think the proposed method would have a broader impact for pathological image analysis. For these reasons, the recommendation is toward acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



back to top