Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Mohammad Minhazul Haq, Junzhou Huang

Abstract

The accurate segmentation of nuclei is crucial for cancer diagnosis and further clinical treatments. For semantic segmentation of nuclei, Vision Transformers (VT) have the potentiality to outperform Convolutional Neural Network (CNN) based models due to their ability to model long-range dependencies (i.e., global context). Usually, VT and CNN models are pre-trained with large-scale natural image dataset (i.e., ImageNet) in fully-supervised manner. However, pre-training nuclei segmentation models with ImageNet is not much helpful because of morphological and textural differences between natural image domain and medical image domain. Also, ImageNet-like large-scale annotated histology dataset rarely exists in medical image domain. In this paper, we propose a novel region-level Self-Supervised Learning (SSL) approach and corresponding triplet loss for pre-training semantic nuclei segmentation model with unannotated histology images extracted from Whole Slide Images (WSI). Our proposed region-level SSL is based on the observation that, non-background (i.e., nuclei) patches of an input image are difficult to predict from surrounding neighbor patches, and vice versa. We empirically demonstrate the superiority of our proposed SSL incorporated VT model on two public nuclei segmentation datasets.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_30

SharedIt: https://rdcu.be/cVRrN

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper addressess the problem of nuclei segmentation in whole slide images (WSI), and how vision transformers (VT) can be used for this. VT are data-hungry and therefore usually used pre-trained on ImageNet, but this is not so useful for nuclei segmentation. The paper presents a pre-training strategy, which makes VT preform better on WSI.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The problem is relevant and well motivated. The approach is sound. Results are promising.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In places, the presentation is unclear. Especially section 3.1, which is a pitty, as this is the main contribution of the paper. For example, you define a matrix Hard, elements of which indicate how hard it is to predict a non-boundary patch. This seems to be only used element-wise to define sets of feature vectors (2). Why not define (2) directly from prediction difficulty. Also, the argumentation for calling this sets foreground and background patches is unclear – this is only an assumption, or…? To me it seems that we only can say that sets contain feature vectors of patches that are difficult/easy to predict. Mathematical notation is difficult to follow due to many variables, and the use of italics (math) font for multi-letter variables. For example $An_{i,j}$ would normally be $A$ multiplied with $n_{i,j}$. The name Hard is also not a good name for a matrix, but if you insist, type it in roman. (On a similar note, for subscripts which are not variables, it should be $L_\mathrm{region}$. And functions like max and softmax should also be typeset in roman.) In Figure 2, the text is very small. In Figure 3, the images are very small (and in the printed version, blue and yellow arrows provide poor contrast to black and white images). Conclusion is blant. It’s much better to make conclusions which are refutable. For example something that: Our results show that VT may be pre-trained to … Tiny thing: in 1 you say k=32, but later it seems you divide into 16x16 patches. What is the explanation?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Level of detail is high, so it should be possible to reproduce the work. It seems authors will not provide the code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please address the comments listed under weaknesses of the paper.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The work is relevant, sound and results a promising. Presentation may be improved.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    The author utilized the self-supervised learning for unlabelled data to tackle the need of large quantity of dataset in transformer pre-training for medical application. The paper proposed a unique framework to pre-train by combining triplet loss, scale loss with a learnable background foreground criterion.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea of using SSL for transformer pre-training is feasible for a wide range of medical application.

    • The idea of quantifying patch prediction difficulty to distinguish between nuclei and background is novel. The proposed-region triplet learning, combined with scale loss is well designed and showed satisfactory improvement.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The comparison seems to lack the setting with TransUNet (segmentation baseline) pre-trained with MoNuSegWSI.

    • The evaluation of the methods did not include an ablation study. Although the performance comparison table does prove that the pre-training on MoNuSeg and the fine-tuning with supervision enhanced the performance. The paper didn’t provide a compelling analysis of how much the triplet loss contributes to the improvements, or whether the pre-training on MoNuSegWSI is responsible for the performance boost.

    • The idea could be useful, however, the layout and writing can still be further improved, and some more revisions might be helpful and necessary. Overall, the submitted version looks an incomplete version completed in a hurry, especially the experiments section.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The complete architecture is shown in the manuscript, and some implementation details reveal hidden and unclear. The author could consider open sourcing their project, which might be helpful to improve its reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. The setup of ablation studies in this work is pale and weak. I don’t find its current version convincing to justify the contribution of each individual design. A more comprehensive ablative setup is suggested, together with a more insightful ablation discussions.
    2. It might be more preferable by the community to denote Vision Transformer as ViT.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see my comments above.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    Thanks for the authors preparing rebuttal. After checking the responses, I’d like to maintain my decision unchanged.



Review #3

  • Please describe the contribution of the paper

    This paper introduces a new self-supervised learning method to pre-train Vision Transformers for nuclei segmentation. The proposed method drives the network to predict the patch features from the surrounding neighboring patches, thereby encouraging the network to learn meaningful nuclei features. This method is motivated by an observation that it is more difficult to predict non-background patches than the background ones. After fine-tuning the pretrained weights on the nuclei segmentation network, in practice it is shown that the proposed method achieves better performance than previous works.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Overall the proposed method is novel. The pretext task of predicting the patch features is interesting. The employment of region-level triplet loss is reasonable.
    • Compared to previous approaches, the proposed method achieves better performance on two public datasets. This shows the benefits of the proposed method.
    • This paper is well written, with clear organization and motivation. The method is easy to follow.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The triplet loss and scale loss are not new. They are from previous works. But I think this is not a big issue.
    • In the paper, it is unclear why the proposed self-supervised training method does not fall into trivial solutions, e.g., the network simply outputs a feature map with constant values. For me, I think this could be due to the scale loss. Some explanations would be necessary in the paper.
    • In the experiment, no ablation studies are provided to show the impact of scale loss and triplet loss. It is unclear if the triplet loss only contributes marginally to the overall performance. In other words, it is possible that the good performance of TransNuSS is mainly due to the scale loss.
    • The proposed method utilizes Vision Transformer, which can be more effective than ResNet in many cases. In the experiment, it is demonstrated that the proposed method outperforms InstSSL, which employs ResNet. Therefore, such a superior performance could be due to the Vision Transformer. To make a fair comparison, it is suggested to compare with InstSSL (and other methods) which also employs Vision Transformer.
    • This paper claims that applying the proposed pre-training technique to a Vision Transformer is a contribution. However, I think this contribution is trivial because Vision Transformer can also be replaced with convolutional networks here.
    • In the experiment, the paper does not compare with other general self-supervised methods which can also be applied to medical image processing, such as MoCo, BYOL, SimCLR, etc. It is suggested to also compare some of these methods to show the advantages of the proposed method.

    References: [1] Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning. NeurIPS 2020. [2] A Simple Framework for Contrastive Learning of Visual Representations. ICML 2020. [3] Momentum Contrast for Unsupervised Visual Representation Learning. CVPR 2020.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper provides some implementation details about the proposed method. It is unclear if the paper will provide the official code. So potentially there could be some issues in reproducing the method.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    (1) The paper could add some explanations for illustrating why the proposed method will not fall into trivial solutions. (2) Additional ablation studies regarding the losses are necessary. (3) To make a fair comparison, it is better to also adopt Vision Transformer in InstSSL and some other methods. (4) It would be better to also compare with other generic self-supervised learning methods, such as MoCo, BYOL, SimCLR, etc. (5) It seems like the font of the paper does not follow the MICCAI paper template. It is suggested to fix this.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall I think the idea of predicting the patch features from the neighboring patches is interesting. This kind of idea has also been adopted by other concurrent papers, but from a different perspective. From the experimental results, it can be observed that the proposed method has some advantages over previous approaches. But it is unclear if such advantages still hold after employing Vision Transformers in the baseline methods. Also, this paper lacks important ablation studies; also it lacks some comparisons with other generic self-supervised learning methods.

    To summarize, the major factors that led me to the overall score are: novelty and experimental validation.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    After reading the rebuttal, I think the authors have addressed most of my concerns, particularly those on ablation studies. Therefore I would like to increase my rating for this paper.

    My remaining concerns for this paper include:

    1. Some of the descriptions in the paper should be improved, an issue also raised by Reviewer 1.
    2. The contribution of triplet loss is not significant, as can be seen from the experimental results in the rebuttal.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The reviewers commented on the novelty of the proposed approach, which includes self-supervised learning for vision transformer pre-training, and combining triplet loss and scale loss with a learnable background foreground criterion for the pre-training. However, they also had some concerns related to missing details, missing ablation study or comparison.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9




Author Feedback

We would like to thank the reviewers and ACs for their constructive comments and acknowledging the novelty of our work. Based on the reviewers’ comments, we will improve the clarity with ablation studies and further details in paper revision.

  1. Ablation studies regarding triplet loss and scale loss (Meta-R#2, R#2, R#3): To understand the impact of each of the losses (i.e., region-level triplet loss, and scale loss), we first pre-train TransNuSS with MoNuSegWSI dataset using a single (i.e., either triplet loss or scale loss) loss, and then we fine-tune the pre-trained model. For experiment-1, the comparison of the quantitative performance of TransNuSS w/o triplet loss, TransNuSS w/o scale loss, and TransNuSS (i.e., with both losses) is: (IoU%: 66.28 vs. 66.72 vs. 67.02, Dice-score: 0.7951 vs. 0.8007 vs. 0.8059) for TransNuSS w/o triplet loss, TransNuSS w/o scale loss, and TransNuSS, respectively. Similarly, for experiment-2, the comparison is: (IoU%: 66.83 vs. 67.66 vs. 68.72, Dice-score: 0.8147 vs. 0.8236 vs. 0.8307), respectively. From these comparisons, we see that the proposed TransNuSS outperforms both of TransNuSS w/o triplet loss, and TransNuSS w/o scale loss. The overall good performance of TransNuSS comes when both losses are applied together. And, among both losses, the scale loss prevents TransNuSS from falling into trivial solutions (e.g., the network simply outputs a feature map with constant values) while pre-training. In summary, both losses complement each other for the excellent performance of TransNuSS.

It was easily assumed that, pre-training on MoNuSegWSI is mainly responsible for the good performance of TransNuSS. However, this assumption is incorrect since two Self-Supervised Learning (SSL) baselines AttnSSL and InstSSL also have been pre-trained on the same MoNuSegWSI dataset but achieved much inferior performance compared with the proposed TransNuSS. Basically, the proposed region-level triplet loss, and scale loss, altogether, helps TransNuSS to outperform both SSL baselines.

  1. Employing Vision Transformer (ViT) in baseline SSL methods (R#3): We employed TransUNet in InstSSL (i.e., replacing ResUNet backbone with TransUNet) model, pre-trained it with MoNuSegWSI, and then fine-tune. We denote TransUNet-employed InstSSL as InstSSL-ViT. For experiment-1, the comparison of the quantitative performance of InstSSL-ViT and TransNuSS is: (IoU%: 66.32 vs. 67.02, Dice-score: 0.7991 vs. 0.8059), respectively. Similarly, for experiment-2, the comparison is: (IoU%: 68.11 vs. 68.72, Dice-score: 0.8256 vs. 0.8307), respectively. Therefore, in both experiments, proposed TransNuSS model outperforms InstSSL even when a ViT is employed into InstSSL. This again proves the effectiveness of proposed region-level triplet loss applying together with scale loss.

  2. Comparison with generic self-supervised methods (R#3): In our experiments, we chose baseline SSL methods AttnSSL and InstSSL over the generic SSL models. Because, AttnSSL and InstSSL were explicitly devised for nuclei segmentation problem, and these two SSL methods perform significantly well for nuclei segmentation with better performance compared with generic self-supervised methods.

  3. Clarifications of foreground and background sets (R#1): In our work, we mentioned that the sets FG and BG in equation-(2) contain foreground and background features. Because, it is observed and believed that predicting foreground patches is harder than background patch prediction (see Fig. 1). Supplementary material also shows and validates this intuition.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper initially got one weak accept, two weak rejects. The authors submitted a strong rebuttal, to provide additional evidence about ablation studies and comparison with other methods. The authors also clarified why they chose the baselines presented in the paper. Due to the strong rebuttal, Reviewer 3 changed rating from weak reject to weak accept post rebuttal. Considering that the paper has novel contributions, provided a strong rebuttal and the final ratings are two weak accept and one weak reject, I lean to accept the paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    12



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Based on observing the difficulty in the prediction of different patches, the paper proposes a new self-supervised learning method for pertaining Vision Transformers for nucleus segmentation. The improvements are promising. In the rebuttal, the authors nicely addressed the reviewers’ questions, mainly the ablation study and comparisons. As a result, one reviewer changes the rating from weak reject to weak accept. I think this paper can be accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper introduces a pre-training method for vision transformers specifically targetting nuclei segmentation in whole slide images. The major weakness was validation (missing ablation studies regarding loss functions), which is well-addressed in the rebuttal with new results. One reviewer increased the rating from 4 to 5, so the overall rating leans towards acceptance. I agree to accept this paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



back to top