Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Qin Ren, Yu Zhao, Bing He, Bingzhe Wu, Sijie Mai, Fan Xu, Yueshan Huang, Yonghong He, Junzhou Huang, Jianhua Yao

Abstract

Digital pathology plays a pivotal role in the diagnosis and interpretation of diseases and has drawn increasing attention in modern healthcare. Due to the huge gigapixel-level size and diverse nature of whole-slide images (WSIs), analyzing them through multiple instance learning (MIL) has become a widely-used scheme, which, however, faces the challenges that come with the weakly supervised nature of MIL. Conventional MIL methods mostly either utilized instance-level or bag-level supervision to learn informative representations from WSIs for downstream tasks. In this work, we propose a novel MIL method for pathological image analysis with integrated instance-level and bag-level supervision (termed IIB-MIL). More importantly, to overcome the weakly supervised nature of MIL, we design a label-disambiguation-based instance-level supervision for MIL using Prototypes and Confidence Bank to reduce the impact of noisy labels. Extensive experiments demonstrate that IIB-MIL outperforms state-of-the-art approaches in both benchmarking datasets and addressing the challenging practical clinical task. The code is available at https://github.com/TencentAILabHealthcare/IIB-MIL.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_54

SharedIt: https://rdcu.be/dnwJ9

Link to the code repository

https://github.com/TencentAILabHealthcare/IIB-MIL

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper proposes a multi-task learning framework for MIL training over bag-level and instance-level supervision simultaneously. For the instance-level pseudo label generation, the authors further propose a Label Disambiguation Module to reduce the impact of noisy pseudo labels.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This work is well motivated, leveraging the weak instance-level supervision for better MIL performance.
    2. The paper is well-written and easy to follow. The logic is sound and most descriptions of the method are clear.
    3. The novelty is acceptable. By using instance-level supervision for the auxiliary task, the bag-level MIL branch can enjoy extra performance boost with no extra computational cost during inference since the auxiliary task is removed then.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Compared with previous MIL methods, this work actually uses a fixed EfficientNet-B0 along with a learnable Transformer as patch embedder, which is much more computational-heavy.
    2. Some important details are missing: 2.1 How to initialize the instance-level classifier? It seems that the initialization of instance classifier would play an important role in the convergence of the entire framework. 2.2 When compared with other methods, do other methods also use EfficientNet+Transformer as patch embedder? If not, it is possible that the performance gain of IIB-MIL is mainly from the complicated patch embedder rather than the instance-level supervision. 2.3 Comparison with state-of-the-art methods, such as DTFD-MIL [ref 1], is missing. 2.4 Fig.2(a) presents the visualization of IIB-MIL patch embeddings, but the visualization of naïve method is missing. Therefore, the readers are unable to know how IIB-MIL improves the patch representations.

    Ref 1. Zhang, Hongrun, et al. “Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This paper uses public datasets, and the authors say they will release the code upon acceptance. Therefore, the overall reproducibility is good.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    It’d be more attractive if the authors could add the aforementioned missing points to the paper, including the initialization method of instance-level classifier, implementation details of other state-of-the-art methods, comparison with DTFD-MIL, and visualized comparison between the patch embeddings from naïve methods and IIB-MIL.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper has several strengths. The logic is sound, and the methodology seems feasible to me. The experimental results also prove the effectiveness of IIB-MIL. However, there also exists some weaknesses that cannot be ignored. The major concern is whether the performance gain of the proposed method is from the over-complicated patch embedder. The performance improvement would be less meaningful if it is mainly because of the increase of computational cost. Additionally, several important details are missing, such as how the instance classifier is initialized. If these weaknesses can be explained by the authors, then I believe the paper is qualified for the conference.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper addresses the potential suboptimal bag-level training and instance-level multi-noise label issues in MIL with limited training sample sizes. The authors propose an instance-level noisy label calibration module and design a multi-task style dual-channel MIL model at both the bag and instance levels. Based on the initial instance embedding, a Transformer-based encoding calibration module is added to achieve a more cost-effective instance-level encoding calibration.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper targets a specific problem and designs a reasonable new MIL architecture under a sensible motivation. The paper conducts experiments on multiple histopathological image classification tasks. The results validate the proposed method, achieving good results, and the impact of different hyperparameter settings on the proposed method seems controllable.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper lacks a discussion of several related works focusing on similar problems published in recent MICCAI conferences. Although the technical details are not exactly the same, it is obviously not the first time someone has focused on the same problem, raising questions about the innovation and insights of the paper.

    The improvement in the performance of the experimental tasks is limited, likely due to the lack of validation on more challenging tasks.

    Since the proposed method adds a Transformer-based encoding calibration module after the instance embedding, and this module is trained along with MIL, it is inevitable to question whether the memory consumption of the proposed method is acceptable when facing extremely high-resolution WSIs.

    The writing of the paper is not perfect, making it difficult for readers to quickly grasp the outline of the method’s main steps before carefully reading the detailed steps.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper does not provide source code or state that source code will be provided. The reproducibility of the proposed method is relatively difficult without open-source code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please follow these questions:

    1. There is a lack of discussion on related literature (Predicting Molecular Traits from Tissue Morphology Through Self-interactive Multi-instance Learning, MICCAI-2022; Pay attention with focus: a novel learning scheme for classification of whole slide images, MICCAI-2021, et al.), and a reason is needed.

    2. From the supplementary material, it can be known that the proposed method adds a Transformer-based encoding calibration module after the instance embedding, and the number of layers in this module is not low. Can a discussion on memory consumption during model training be provided? Is the model training limited on GPUs with smaller memory?

    3. Like the reasons mentioned above, the proposed method sets the patch size to 1120 and the MIL batch size to only 4. Is this to balance the high memory consumption requirements? These parameters are unusual. When replicating the comparison methods, were such parameters also used? Is this unfair to the comparison methods? Maybe they would achieve better results with their original conventional parameters.

    4. What is the original dimension of the instance embedding? What is the dimension after calibration? Figure 2 is interesting. The patch features seem to be separated after the backbone, but I am very interested in whether they are separated before the backbone.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Considering the relatively complete technical contribution of this paper, I expect the authors to provide a high-quality rebuttal, and I will reserve the option to change my rating.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The author answered my questions and some of my concerns were addressed. I am willing to revise my rating slightly. However, other reviewers seem to have found more nuanced flaws. Also, my questions about patch-size and batch-size are still unresolved, which makes the author need to undertake unconventional parameter settings when they hope to implement instances-improve-bags, and its practicality needs to be doubted. Although these issues make the paper dubious, it can still be brought up for discussion at the conference. In the Rebuttal, the authors provide an insightful discussion of the methods of more relevant literatures, which should be added to the camera-ready version if available.



Review #4

  • Please describe the contribution of the paper

    This paper presents IIB-MIL, a novel approach to analyzing whole-slide images through integrated instance-level and bag-level supervision. The proposed method utilizes a label disambiguation module to establish more precise instance-level supervision and combines it with bag-level supervision to enhance performance. Experimental results demonstrate that IIB-MIL outperforms current state-of-the-art techniques on publicly available datasets and holds significant potential for addressing more complex clinical applications, such as predicting gene mutations. Additionally, IIB-MIL can identify highly relevant patches, providing pathologists with valuable insights into underlying mechanisms.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This article has a clear train of thought and is written very well.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    However, it also has some obvious shortcomings, which are listed below:

    1. The related and comparative works are not comprehensive. In fact, there have been many WSI classification works in the field that combine bag-level loss and instance-level loss based on MIL, and the latest SOTA methods include [1-4]. The author should elaborate on the differences between their method and these methods and supplement them in the article. In addition, performance comparison is needed.

    [1] Qu L, Wang M, Song Z. Bi-directional weakly supervised knowledge distillation for whole slide image classification[J]. Advances in Neural Information Processing Systems, 2022, 35: 15368-15381.

    [2] Shi X, Xing F, Xie Y, et al. Loss-based attention for deep multiple instance learning[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(04): 5742-5749.

    [3] Myronenko A, Xu Z, Yang D, et al. Accounting for dependencies in deep learning based multiple instance learning for whole slide imaging[C]//Medical Image Computing and Computer Assisted Intervention–MICCAI 2021, 2021: 329-338.

    [4] Qu L, Luo X, Liu S, et al. Dgmil: Distribution guided multiple instance learning for whole slide image classification[C]//Medical Image Computing and Computer Assisted Intervention–MICCAI 2022, 2022: 24-34.

    1. The performance of instance-level is not compared. How can the accuracy of the generated instance labels be guaranteed? What is the accuracy of the instance classifier?

    2. The advantages of using prototype-based label disambiguation strategy are not clearly demonstrated. The author should further strengthen the explanation of the advantages of using prototype-based instance loss.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The author promises to opensource the code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please refer to weaknesses.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, this article has a clear train of thought, but lacks a summary and comparison of other SOTA methods in the field that combine bag-level loss and instance-level loss. The advantages and explanations of the proposed module are not sufficient, and some experimental results need to be supplemented and strengthened. However, the idea is interesting, and in the first round of review, I would like to give a weak accept score. However, the author needs to carefully address all of my concerns in the rebuttal, and I will carefully revise my score after the rebuttal.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    The author has provided a detailed explanation for my first question, but did not give a specific comparison with the SOTA methods mentioned. Additionally, I did not understand the author’s feedback 2, as their answer was vague and lacked specific comparisons. And the second and third questions still lack satisfactory answers. When I mentioned “instance-level evaluation,” I was referring to assessing the accuracy of generated instance labels using real instance labels, rather than simply measuring the performance by removing the bag-level loss in ablation experiments. Additionally, the advantages of using a prototype-based label disambiguation strategy have not been clearly demonstrated.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper introduces IIB-MIL, a novel approach that tackles the challenges of MIL training and instance-level label noise in the analysis of whole-slide images. The proposed method leverages integrated instance-level and bag-level supervision and incorporates a Label Disambiguation Module to mitigate the impact of noisy pseudo labels. By designing a multi-task dual-channel MIL model and incorporating a Transformer-based encoding calibration module, IIB-MIL achieves more efficient instance-level encoding calibration. Experimental results demonstrate the superiority of IIB-MIL over current state-of-the-art techniques on publicly available datasets.

    Given the conflicting opinions among the reviewers, it is crucial to ascertain whether the concerns raised by Reviewer 3 were adequately addressed in the rebuttal. The reviewers express criticism regarding the lack of discussion on related works, questioning the innovation and insights of the proposed method. They also note the limited performance improvement, potential memory consumption, and lack of clarity in the paper’s writing, making it difficult for readers to comprehend the main steps of the method. Furthermore, the reviewers emphasize the need for a more comprehensive discussion of related and comparative works in the field of WSI classification, including performance comparisons with other methods.




Author Feedback

We appreciate the agreement of IIB-MIL’s novelty and address the concerns.

1 Related works and method innovation [R3, R4] We compare previous works mentioned by all reviewers and clarify our novelty. DTFD-MIL[CVPR2022] proposed a double-tier framework to address the small sample issue, but it faces the issue of the noisy labels for the pseudo-sub-bags, which is handled by the label-disambiguation method in our IIB-MIL. FocAtt-MIL [MICCAI2021] improved the conventional AB-MIL with a focal-attention mechanism, which is a bag-level-supervision-only approach. DGMIL[MICCAI2022] employs cluster-conditioned feature distribution modelling and pseudo-label-based iterative feature refinement strategies to separate instances. However, it requires strict adherence to the MIL assumption and the iteratively K-means clustering and distance calculations among the entire enormous training WSI tiles are computationally expensive. Besides, its simple mean-pooling aggregation may hinder its performance in complicated tasks. Implicitly or explicitly, Inter-MIL[MICCAI2022], Pyramid Transformer MIL [MICCAI2022], and WENO [NeurIPS 2022] utilize both the bag-level and instance-level supervision. These methods can be summarised as bag-improves-instance, where they assign attention scores obtained from bag-level supervision as pseudo labels for instances. However, recent works (DTFD-MIL, Loss-Attention [AAAI2020]) have argued that attention score is not a rigorous metric for this purpose since it is hard to obtain accurate attention scores given limited WSI-label training pairs. Therefore, Loss-Attention tried to improve the attention mechanism by connecting the attention calculation with the added additional loss function. Differently, IIB-MIL introduces noisy-label learning to address the weak supervision nature of MIL. We design a new label-disambiguation-based instance-level supervision using prototypes and a confidence bank to reduce the impact of inaccurate labels. Then we integrate bag-level and instance-level supervision to combine their advantages. IIB-MIL can be described as an instance-improves-bag method, without using attention scores as the pseudo labels. The label-disambiguation strategy has the advantage of gradually updating the pseudo labels for instances relative to positive-label, negative-label, and label-irrelevant to be approximately 1, 0, and in the middle, respectively, rather than keeping as 1 or 0.

2 Performance comparisons [R2, R3, R4] We compared IIB-MIL with the above-mentioned 6 SOTA methods. IIB-MIL obtained comparable or superior performance of at least 1.38%, 0.74%, and 0.52% over the second-best one in three tasks.

3 Performance improvement [R3] The prediction of gene mutations is a challenging task, which even skilled pathologists struggle with [Inter-MIL[MICCAI2022]]. Nevertheless, IIB-MIL achieved a significant performance boost of over 1.78% (P<0.05) than SOTA methods. Furthermore, we observed a significant performance increase on both TCGA NSCLC and RCC datasets (P<0.05). All compared methods were implemented with their reported best parameters.

4 Computational consumption [R2, R3] IIB-MIL (2.8M) has a smaller model size than SOTA bag- and instance-level co-supervision methods: WENO (20.7M) and Inter-MIL (11.0M), and is comparable to SETMIL: 9.9M, TransMIL: 2.7M, DGMIL:1.6M, DSMIL: 1.8M, ABMIL: 1.5M, DTFD-MIL:0.8M. After testing, we found a variant of IIB-MIL with only one transformer block in the backbone (0.5M) performed similarly, indicating the advantages mainly come from the supervision design.

5 The initialisation, and performance of instance-level [R2, R4] The initialization strategy is similar to PiCO [ICLR2022], which is convergence robust. The performance of instance-level supervision has already been compared in the ablation study (Table 2 w/o Bag).

6 Writing [R3] We revised the method section to make it easier to grasp the outline. As stated in the abstract, our code will be provided.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The proposed technique introduces an effective approach for generating instance-level pseudo-labels and utilizing them in MIL (Multiple Instance Learning) training. From the discussions among the paper and the reviewers, there are no major flaws, and it has been well compared with state-of-the-art techniques. However, R4 requested instance-level evaluation, and I agree that there is a lack of response in that regard. It is necessary to include additional experimental results on how well the classification performs at each patch level using data with tumor segmentation labels.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper presents a multi-task learning framework to utilize both slide-level and instan-level supervision for WSI analysis and targets label disambiguation, which is very reasonable and relevant in WSI analysis. Before the rebuttal, this paper receives two positive comments and one negative comments, where the negative comments mainly focus on lack of discussion of related works and thus the position of this work is unclear. The authors provide good rebuttal for this concerns and the reviewers change the score from weak reject to week accept. Another reviewers change the original weak accept score to weak reject due to lack of comparison and experiments in the paper and rebuttal. However, the instance-level evaluation is expervise and difficult.

    This work in its current version should provide useful information in the community and lead to informative discussion. I vote for accepting this paper.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The responses of the authors during the rebuttal process is not convincing enough. We hope that the constructive remarks will help you to improve the work for any future submission.



back to top