Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Linhao Qu, Xiaoyuan Luo, Shaolei Liu, Manning Wang, Zhijian Song

Abstract

Multiple Instance Learning (MIL) is widely used in analyzing histopathological Whole Slide Images (WSIs). However, existing MIL methods do not explicitly model the data distribution, and instead they only learn a bag-level or instance-level decision boundary discriminatively by training a classifier. In this paper, we propose DGMIL: a feature distribution guided deep MIL framework for WSI classification and positive patch localization. Instead of designing complex discriminative network architectures, we reveal that the inherent feature distribution of histopathological image data can serve as a very effective guide for instance classification. We propose a cluster-conditioned feature distribution modeling method and a pseudo label-based iterative feature space refinement strategy so that in the final feature space the positive and negative instances can be easily separated. Experiments on the CAMELYON16 dataset show that our method achieves new SOTA for both global classification and positive patch localization tasks.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_3

SharedIt: https://rdcu.be/cVRq0

Link to the code repository

https://github.com/miccaiif/DGMIL

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed a feature distribution guided deep MIL framework for WSI classification and positive patch localization. Specifically, The authors proposed a cluster-conditioned feature distribution modeling method and a pseudo label-based iterative feature space refinement strategy. This framework achieves a new SOTA on the CAMELYON16 dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper models the multi-instance problem from the perspective of the data distribution of pathological slides, which doesn’t need to design complex discriminative networks.
    2. This paper is well-written and logically clear.
    3. This paper achieves a new SOTA on the CAMELYON16 dataset, verifying the effectiveness and superiority of the method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This paper only uses AUC for evaluation on one dataset (CAMELYON16) and does not visualize slides with high positive scores.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have clarified that the code will be available in the reproducibility checklist.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. This paper analyzes the binary classification problem as an example. For a multi-classification problem, how to choose negative slides to cluster at the beginning of each iteration?
    2. The author divided each WSI into 512x512 patches without overlapping under 5x magnification. Is the performance of DGMIL affected at other magnifications? Does the clustering step take up a lot of training time if the patch is split under high magnification?
    3. Why are the AUC results in Table 1(a) different from those published in the DSMIL article?
    4. How to determine if a patch is positive or negative when calculating the metric of patch AUC? what is the threshold of positive score for determining a patch is positive or negative?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper uses multi-instance learning to solve the problem of pathological image classification from the perspective of data distribution, and the perspective is novel. However, this paper only conducts experiments on one dataset and lacks visualization results.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    This paper focuses on improving the learning of feature extractor for multiple instance learning for WSI classification. The authors initialize the feature extractor via self-supervised learning (i.e., MAE) and perform clustering, instance selection, and refinement iteratively. Validations and ablation studies on Camelyon16 show its effectiveness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-organized and easy to follow. The motivation and method are clearly demonstrated.
    2. The experiment is basically complete. The proposed method achieves performance gain on Camelyon16.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Lack of explanation for some settings: e.g., the reason for using Mahalanobis distance instead of others like cosine similarity, the reason for using feature after projection head instead of backbone, and the definition of refinement convergence, etc.
    • The reported results in table.1(a) are significantly lower than that in DSMIL, e.g., the Slide AUC of Ab-MIL is 0.6612 in this paper but is 0.8653 in DSMIL. I notice that the settings of patch size (512 vs. 224) and magnifications (5x vs. 20x) are different. It would be more convincing to run the proposed method in the same settings as DSMIL instead of reproducing other works to avoid unfair comparison.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Seems good as almost all [YES] are chosen.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • More explanation of some settings: (a) It would be more convincing to either illustrate the advantage of using Mahalanobis distance or show the proposed method is agnostic to the choice of distance. (b) Authors are encouraged to conduct the experiment of using the feature of backbone for refinement. (c) It would help to detail the refinement training setting, e.g., epochs, learning rate, etc, especially the definition of refinement convergence.
    • More explanation of the performance gap: the reported results in table.1(a) are significantly lower than that in DSMIL, e.g., the Slide AUC of Ab-MIL is 0.6612 in this paper but is 0.8653 in DSMIL. Authors use different settings of patch size (512 vs. 224) and magnifications (5x vs. 20x). It would be more convincing to run the proposed method in the same settings as DSMIL instead of reproducing other works to avoid unfair comparison.
    • As is figured out in introduction, many key-instance based methods may give wrong pseudo labels or select insufficient instances. The authors could further validate that the propose method solve this problem by more detailed ablation study or visualization.
    • Reference mistake: (a) Mistakes:The reference of Loss-based-MIL and CHiknotew-MIL are mistaken. (b) Missing discussion of [1,2]. [1] Xu, Yan, et al. “Multiple clustered instance learning for histopathology cancer image classification, segmentation and clustering.” CVPR, 2012. [2]Sharma, Yash, et al. “Cluster-to-conquer: A framework for end-to-end multi-instance learning for whole slide image classification.” Medical Imaging with Deep Learning. PMLR, 2021.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well-motivated and easy to follow, but it lacks details and explanations. Especially, the reported results are a bit unconvincing as are much lower than that in previous work.

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    This paper proposes a distribution guided multiple instance learning framework for whole slide image classification. The proposed method refines the instance representation in the latent space, by using a cluster-conditioned feature distribution modeling method and a pseudo label-based iterative feature space refinement strategy. It outperformance some state-of-the-art methods on CAMELYON16 dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    +The proposed cluster-conditioned feature distribution modeling method and a pseudo label-based iterative feature space refinement strategy is interesting and can be beneficial to the community on better utilizing the MIL for medical images.

    • The paper is well organized and easy to follow.
    • The performance improvement against the state-of-the-art is significance on CAMELYON16 dataset.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -About the proposed method. Rigid MIL formulation requires the permutation invariance of the MIL aggregation function (e.g., from the instance space to the bag space). However, the authors do not discuss this, and whether or not the cluster-conditioned feature distribution modeling method and a pseudo label-based iterative feature space refinement strategy can grantee the permutational invariance is unclear in the current form of the manuscript.

    -The authors missed multiple latest works of deep MIL and its application in the medical image. For example, Deep MIL: [1] A multiple-instance densely-connected ConvNet for aerial scene classification. IEEE Transaction on Image Processing 29, 4911-4926 (2020)

    Deep MIL & its application in medical image: [2] Multi-instance multi-scale CNN for medical image classification. International Conference on Medical Image Computing and Computer Assisted Intervention, (2019) [3] Local-global dual perception based deep multiple instance learning for retinal disease classification. International Conference on Medical Image Computing and Computer Assisted Intervention, (2021)

    The authors are suggested to enrich the related work, and if necessary make more comparison.

    -More validation on additional dataset is highly preferred to fully demonstrate the effectiveness of the proposed method.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The dataset is publicly available. The proposed method is clear enough for coding and realization. However, I don’t know if it can be confirmed for reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please refer to the weakness part in Sec.5. -Justification of the permutation invariance in the latent space is necessary. -More related works need to be covered, and if necessary need to be compared. -Validation on more datasets is highly preferred. -Also, some typos need to be corrected before publication. For example, in Eq1, what’s ‘iff’?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Both strength and weakness are quite obvious and this manuscript is clearly a borderline paper. Currently I am willing to rate a weak accept due to its interesting idea and significance performance gain, which can be beneficial for the community.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    All of my concerns are addressed. I lean more to accept this work. I still recommend weak accept. Hope the authors could revise the paper well before publication due.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper presents a weakly-supervised learning method for WSI-level and patch-level classification using CAMELYON16 dataset. The method consists of self-supervised feature extraction, clustering and pseudo-label based feature space refinement. Evaluation shows improved performance over existing methods. While all reviewers are generally positive about the paper, there are some issues particularly related to evaluation. For instance, typically in MICCAI papers, more than one dataset would be used for evaluation. The use of 5x patches is not that standard. Evaluation using only AUC is not sufficient and typically for patch-level classification, FROC is used on CAMELYON16. There are many other questions as well. The authors should prepare a good rebuttal to address the questions.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5




Author Feedback

We sincerely thank AC and all reviewers for your valuable comments. We first reply questions raised by multiple AC and reviewers and then other questions from every reviewer.

(Q1) For only AUC metric on one dataset. (Meta, Reviewer1, Reviewer3) (A) 1) We calculate the Slide Accuracy metric and patch FROC metric on the 5x CAMELYON16 dataset and both results are higher than that of DSMIL. We achieved Slide Accuracy=0.8018 (DSMIL: 0.7359), FROC=0.4887 (DSMIL: 0.4560). 2) We added an experiment on the 20x TCGA Lung Cancer Dataset under the same experimental settings as DSMIL. Since this dataset does not have patch-level labels, we only evaluated the slide classification performance. Our method achieved AUC=0.9702 and Accuracy=0.9200 (DSMIL: AUC=0.9633, Accuracy =0.9190). These results will be added to the final version.

(Q2) For 5x resolution. (Meta, Reviewer1, Reviewer2) (A) 5x, 10x, 20x, and 40x are all common magnifications in pathology image processing. For considerations of computational efficiency and resources, we used 5x (vs. DSMIL 20x) in our experiments. We used a patch size of 512 (vs DSMIL 224), and a patch is labeled as positive if it contains 25% or more cancer areas (not specified in DSMIL). These different settings result in the difference between the metrics reported by us and those reported by DSMIL. We will open source the dataset and code to ensure reproducibility. With the hardware of i7-11700K CPU, one clustering consumes about 12 minutes for 20x TCGA dataset (~3 million patches); while about 2 minutes for 5x CAMELYON16 dataset (~150,000 patches).

(Q3) For mistakes and missing related works. (Reviewer2, Reviewer3) (A) Thank you very much for pointing out the mistakes and all of them will be fixed. We compared to SOTA methods in the field of pathology MIL and more related works will be discussed.

Reviewer1: (Q4) For multi-classification problem. (A) Most existing pathology MIL problems are binary classification. For a multi-classification problem, one way is to transform them into multiple consecutive binary classification problems.

(Q5) For threshold determining positive/negative patches. (A) We determine the optimal threshold using the validation set and then use that threshold to decide whether the test patches are positive or negative.

(Q6) Does not visualize slides with high positive scores. (A) Visualization will be added to supplementary materials.

Reviewer2: (Q7) For Mahalanobis distance. (A) Mahalanobis distance is widely used to measure the distance between a point and a distribution. It utilizes the mean and variance of the dataset, overcoming the problems of scale and correlation inherent in the Euclidean distance. Cosine similarity only measures the angle between vectors and is seldom used to measure the distance between a point and a distribution.

(Q8) For using the feature of backbone for refinement. (A) Adding a simple Linear Projection Head has proven to be very efficient in feature learning [1]. Directly fine-tuning the MAE’s encoder will introduce a large number of parameter adjustments.

[1] Chen T. et al. A simple framework for contrastive learning of visual representations. In: ICML. pp. 1597–1607. PMLR (2020)

(Q9) For refinement settings. (A) For feature refinement, we use Adam optimizer with an initial learning rate of 0.01 and cosine descent by epoch. Both the Linear Projector Head and Classification Head are composed of one fully connected layer. Refinement convergence means that the decrease of the cross-entropy loss is below a small threshold in 10 consecutive epochs. These settings will be added to the final version.

Reviewer3: (Q10) For permutation invariance. (A) Both the training and the inference are based on each independent instance, without using its position information in the slide. After obtaining the scores of each instance, the bag score is calculated using mean pooling, which is also permutation invariance.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper presents a weakly-supervised learning method for WSI-level and patch-level classification using CAMELYON16 dataset. The method consists of self-supervised feature extraction, clustering and pseudo-label based feature space refinement. Evaluation shows improved performance over existing methods. During rebuttal, an extra 20x TCGA Lung Cancer Dataset was added and some other updates were reported. These changes should be included in the final version.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a feature distribution based MIL for WSI classification. It utilizes strategies like clustering and refinement. The research idea is reasonable and interesting, and extensive experiments demonstrate the effectiveness. The rebuttal has addressed most of the reviewers’ concerns. The reviewers have commented with several valuable suggestions, and it is suggested that the authors revise the paper accordingly for the final version.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper presents an interesting multi-instance learning method for WSI classification. In the first-round of review, the reviewers anonymously provided positive reviews on this paper. I think the rebuttal has addressed the concerns about the rigor of the methodology and the issues in experimental setting. In my opinion, this method is interesting for the community to discuss during the MICCAI conference since the issue that the paper is trying to address is a long-lasting challenge in this field. For these reasons, the recommendation is toward acceptance.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



back to top