Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Mengkang Lu, Tianyi Wang, Yong Xia

Abstract

Breast cancer (BC) is one of the most common cancers identified globally among women, which has become the leading cause of death. Multi-modal pathological images contain different information for BC diagnosis. Hematoxylin and eosin (H&E) staining images could reveal a considerable amount of microscopic anatomy. Immunohistochemical (IHC) staining images provide the evaluation of the expression of various biomarkers, such as the human epidermal growth factor receptor 2 (HER2) hybridization. In this paper, we propose a novel multi-modal pre-training model via pathological images for BC diagnosis. The proposed pre-training model contains three modules: (1) the modal-fusion encoder, (2) the mixed attention, and (3) the modal-specific decoders. The pre-trained model could be performed on multiple relevant tasks (IHC Reconstruction and IHC classification). The experiments on two datasets (HEROHE Challenge and BCI Challenge) show state-of-the-art results.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_44

SharedIt: https://rdcu.be/dnwJZ

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors proposed a novel multi-modal pre-training framework via masked autoencoders for breast cancer diagnosis. The pre-training model includes the modal-fusion encoder, the mixed attention and the modal-specific decoders. Experiments on the public datasets evaluate the performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    – Innovation. The paper proposed innovative pre-training based on H&E staining WSIs and IHC staining WSIs. – Improvement of performance. The paper achieved some improvement on two public challenge datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    – Lacks technical originality. The paper simply applied MAE model for pre-train WSIs in H&E and IHC staining and then utilized cross-attention mechanism to fuse the two modal data which is more of a simple validation exercise lacking of originality. – The results of comparison experiments are not very attractive. The results of the proposed model on BCI Challenge achieved higher PSNR by 1.60, and SSIM by 0.007 which are not very impressive. The improvement of performance on HERONE Challenge is much weaker. – Insufficient workload. The results of other methods on the two public challenges were copied from their leaderboard, which means authors completed only two sets of experiments. – Lacks comparison with important related work. There are other similar work to predict HER2 status with HE WSIs, such as [1], which should be compared with the paper.

    [1] Lu W , Toss M , Rakha E , et al. SlideGraph+: Whole Slide Image Level Graphs to Predict HER2Status in Breast Cancer. 2021.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The experimental data are from the public challenges. The authors say the code will be made available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    – Correct some contradictions in the context. In the abstract, the authors said that “The pre-trained model could be performed on multiple relevant tasks (IHC Reconstruction and IHC classification).”, which actually is HER2 status prediction with HE WSIs. What’s more, in section 2.2. Reconstruction Loss, the paper said that “The input of the mixed attention module is the full set of tokens, which include both the remaining patch tokens and the masked patch tokens.”, however, the Fig.2. illustrated that only remained tokens were feed into the mixed attention module. – Need some ablation studies to verify the effectiveness of modules. In the pretraining phase, the inputs of encoder are two modal data. However, only the H&E WSIs are fed into encoder during downstream tasks. The authors need to construct ablations to verify the effectiveness of the pretrained encoder; – Add more comparison experiments with other methods.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    – Lack of technological innovation. – Insufficient workload. – Rough article writing.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors proposed a multimodality framework leveraging two different staining images, including the conventional H&E and immunohistochmical staining (IHC). The most interesting is the proposed mix attention modules aiming learning the inter- and intra-modality correlations. The proposed masked autoencoder may also be important in addressing one of the challenge in mutimodality learning, the missing modality.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The mask auto-encoder could potentially address missing modality challenge, also could be a better approach to avoid reconstructed image falls into the mode collapse problems. The mixed attention component allows an opportunity to learn the inter- and intra modality correlations, which is important in multimodality fusion.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors pointed they used the learning rate decay strategy in HER2 staining image generation task, however, they did not provide the details of the learning rate decay strategy they were using, for example, it is a L1 or L2 weight decay? what is the weight decay rate?

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors did provide a complete information to reproduce the experiment. Only tiny missing detail is the learning rate decay strategy for HER2 staining image generation task.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The authors proposed a multimodality framework leveraging two different staining images, including the conventional H&E and immunohistochmical staining (IHC). The most interesting is the proposed mix attention modules aiming learning the inter- and intra-modality correlations. The proposed masked autoencoder may also be important in addressing one of the challenge in mutimodality learning, the missing modality. The details of the training and experimental design is clear. The figures in the paper helps to illustrate the architecture/pipeline, the pseudo code helps to understand the algorithm.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    8

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The innovation of the proposed architecture, also the detailed information, clearly and nicely created figures.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents a multi-model pathological (H&E and IHC) pretraining model based on an autoencoder. The pre-trained model encoder is then used for two downstream tasks (HER2 staining image generation and HER2 status prediction from H&E).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The manuscript is well written and easy to follow. It is encouraging to see how different publicly available data sets were used to perform this study. The achieved results on the HEROHE contest data looks promising but for teams participated in the contest were not aware of test dataset distribution.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Below are some minor/major concerns regarding this manuscript

    • The overall concept of using masked-autoencoder for digital pathology data has been previously presented multiple times (https://arxiv.org/pdf/2209.01534.pdf, https://arxiv.org/pdf/2205.09048.pdf). It makes more sense to compare it with some previously proposed approaches in this direction. • The technical contribution is very minimal, so it is important to access/validate this pretrained model on multiple downstream tasks. • Ablation study is missing so it is difficult to access the performance of different components of the proposed approach • Literature review is not comprehensive makes it difficult to understand the contributions compared to what has already done in this regard, • Intuition behind mixed attention and the combination of Q_x-> K_y and vice versa is not clear. • Overall inter-modal attention resembles to cross attention, it is good practice to provide the necessary citations for existing approaches • The supplementary material does not any value or does have any description to link it with the manuscript • Minor suggestion: another potential publicly available data to consider for HER2 classification https://pubmed.ncbi.nlm.nih.gov/28771788/

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper does not contain any information about releasing the code. Some details about the method implementation have been reduced but more information will be needed for reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    As decribed in the main weakeness section

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Already described in the main strengths of the paper

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper investigates a multi-modal pre-training framework via masked autoencoders for breast cancer diagnosis from multi-modal pathological images (H&E and IHC). The reviewers and AC acknowledge the interesting and innovative problem setting, the reasonable framework design and the good performance. Although the technique (multi-modal mask autoencoder) is not original, this framework can inspire more research in the medical imaging domain. The reviewers have proposed many valuable comments to improve this paper. Please address them in the final version.

    1. clarify the technical details (R2 and R3).
    2. Discuss the difference and relationship with other related works (R1 and R2)




Author Feedback

  1. Motivation (R1,R2) Our method Multi-Modal Pathological Masked AutoEncoders (MMP-MAE) is based on Masked AutoEncoders (MAE) for multi-modal pre-training. We focus more on the practical clinical issues: complete missing modality for better breast cancer diagnosis. Multi-modal pre-training has achieved great success in computer vision area, which could also bring tremendous improvement in multi-modal pathological image analysis. To our best knowledge, this is the first pre-training work based on multi-modal pathological data.

Two related works based on MAE (https://arxiv.org/pdf/2209.01534.pdf, https://arxiv.org/pdf/2205.09048.pdf) are both for single modality (H&E) pre-training. In the latter work, the authors adopt a separation approach to extract H channel and E channel images from H&E images. SlideGraph+ (https://arxiv.org/pdf/2110.06042.pdf) utilizes an in-house dataset, and we will compare this method on HEROHE dataset in the next version of our paper.

  1. Clarify the technical details (R2,R3) In multi-head self-attention (MHSA) and multi-head cross-attention (MHCA) modules, the vectors Q, K, and V represent patch vectors. Within the MHSA module, Q_x, K_x, and V_x are utilized to establish correlations among patches within the same modality. On the other hand, in the MHCA module, Q_x, K_y, and V_y are employed to establish correlations among patches across different modalities. In our paper, Q_x, K_y, and V_y enable us to leverage the complementary information between H&E and HER2 images through the MHCA module. (R2)

The learning rate strategy in HER2 generation task is that we reduce the learning rate to 1% after every 50 epochs. This strategy aids in stabilizing the training process and potentially allows the model to converge to a better solution over time. (R3)

  1. Limit improvement on HER2 generation tasks (R1) In the paper of BCI dataset, the authors point that “it is still very challenging to establish an accurate mapping from HE to HER2 expression on our dataset”. The paired data in BCI dataset is not aligned very well. But in our pre-training on ACROBAT dataset, we have made significant progress by reducing the error distance to under 200μm and aligning the paired images effectively. The disparity between the two datasets could explain the limited improvement observed on the BCI dataset. To further validate the effectiveness of our method, we plan to evaluate it on additional datasets such as the ANHIR dataset and the dataset mentioned by R2 to assess its generalizability. Additionally, we will divide ACROBAT dataset into training, validation and test set to further verify the effectiveness of our method.

  2. The mistakes, related references, and complete experiments Thank you for recognizing the valuable input provided by the reviewers. We sincerely appreciate their efforts in identifying the mistakes in our paper, and we are fully committed to addressing them diligently. We will implement the necessary corrections as suggested by the reviewers and ensure that the references they mentioned are appropriately included in our paper. We are limited by the number of pages in the paper, we will enhance our experiments in the next version of the paper. We consider their feedback to be invaluable in improving the quality and credibility of our work, and we are deeply grateful for their valuable contributions.



back to top