Authors

Ziniu Qian, Kailu Li, Maode Lai, Eric I-Chao Chang, Bingzheng Wei, Yubo Fan, Yan Xu

Abstract

Hispathological image segmentation algorithms play a critical role in computer aided diagnosis technology. The development of weakly supervised segmentation algorithm alleviates the problem of medical image annotation that it is time-consuming and labor-intensive. As a subset of weakly supervised learning, Multiple Instance Learning (MIL) has been proven to be effective in segmentation. However, there is a lack of related information between instances in MIL, which limits the further improvement of segmentation performance. In this paper, we propose a novel weakly supervised method for pixel-level segmentation in histopathology images, which introduces Transformer into the MIL framework to capture global or long-range dependencies. The multi-head self-attention in the Transformer establishes the relationship between instances, which solves the shortcoming that instances are independent of each other in MIL. In addition, deep supervision is introduced to overcome the limitation of annotations in weakly supervised methods and make the better utilization of hierarchical information. The state-of-the-art results on the colon cancer dataset demonstrate the superiority of the proposed method compared with other weakly supervised methods. It is worth believing that there is a potential of our approach for various applications in medical images.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_16

SharedIt: https://rdcu.be/cVRrn

Link to the code repository

https://github.com/Nexuslkl/Swin_MIL

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a transformer-based multiple instance learning (MIL) for weakly supervised histopathology image segmentation. The motivation behind this paper is modeling dependencies of MIL instances via multi-head self-attention in the transformer. In addition, the authors propose deep supervision to overcome the limitation of annotations in weakly supervised scenarios and make the better utilization of hierarchical information from the Swin transformer. The experimental results demonstrate the method’s superiority compared with other weakly supervised methods for the segmentation task.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

-Overall, the paper is very well-written and presents a successful combination of current methods for achieving a successful weakly-supervised segmentation method on histopathology images.

-The state-of-the-art segmentation results are achieved on the colon cancer dataset.

-The authors carry out enough ablation studies on the components of their method, proving the efficacy and contribution of each of these parts.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

-The paper does not present a significant novel concept but rather combines existing approaches in an efficient manner to the given task. It builds upon the Swin Transformer and deep supervision for the side outputs. The latter has already been used for the CNN-based segmentation approaches. Addressing the correlation of MIL instances was the theme of a few recent papers, e.g., the TransMIL method. However, since they are mostly addressing the classification task rather than segmentation, I still believe the minor contribution is valuable.

-Since there is no separate validation set, I wonder how the authors determine optimal values for the hyperparameters, e.g., the parameter r of the generalized mean function and the weights of the three side-output layers. It is also unclear how much performance improvements are due to hyper-parameter tuning.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Probably. The authors mention that all code and models will be made available upon acceptance.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

I suggest adding the discussion about chosen hyperparameters, discussion of the weakness of the approach, and possible failure cases.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although a weak novelty, the proposed model is motivated by reasonable ideas of modeling correlation of instances of MIL approach for the segmentation task, and experimental results show the effectiveness of the proposed method. Considering the possibility of application in a wide range of medical image segmentation, I lean toward accepting this paper.
Number of papers in your stack

3
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper proposed a Transformer based method for semantic segmentation in histopathology image. Swin Transformer is introduced to this specific task to consider related information between instances in MIL. The method was evaluated on public colon cancer dataset in comparison with a number of MIL methods and reached SOTA results. Besides, the Ablation study explored the effect of backbones, stages, and deep supervision. The idea seems to be promising and valuable for research in this field, however, the paper needs to be further improved.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper introduced Swin Transformer to MIL and semantic segmentation of pathology image.
2. Many comparison and ablation study proved the effectiveness of the proposed method.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The introduction of Swin Transformer to pathological image segmentation is novel, which considers the relations between instances in MIL. However, the paper lack further explanation or visualization about how Swin Transformer model the relations and benefit segmentation.
2. Deep supervision is also an existing method in reference paper.
3. Most parts of the method is described in detail. However, some details are still unclear and need to be further clarified.
4. The method is compared with some MIL methods, but only one FSL method, U-Net. Also, the experiments are conducted on a single dataset with 2 metrics. It might be better to provide experiments on more datasets, such as camelyon16, with more metrics like dice coefficient.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper gives detailed explanation of its method and experimented on a public dataset and prepares to release the code, thus the reproducibility is good.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. It would be better if the authors further explain how Swin Transformer consider relations between instances. E.g. in Method, the statement “Similar features get high attention weights while dissimilar ones get low attention weights, which leads to an improvement in distinguishing foreground and background” need further explanation.
2. The image size and patch size needs further to be clarified. I wonder whether the H&E refers to original size(3000) or downsampled size(256). Also, is it necessary to downsample the images from 3000 to 256 since only 4*4 patches are fed into Swin Transformer.
3. Some details of the method need further explanation or correction. I wonder how multi-scale features are fused through fuse layer. In Fig.1, the “Swin Transformer Block” is actually two successive Swin Transformer blocks. Also, the Decoder is mixed with the “structure of decoder” so the clarity can be improved.
4. The paper uses F1 and HD as metrics. But there are many other popular metrics such as dice coefficient for semantic segmentation and it would be better if these metrics are adopted.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method is novel in introducing Swin Transformer with deep supervision to pathologic image segmentation. Some comparison and ablation study proved the effectiveness of the method.
Number of papers in your stack

3
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

In this paper, a transformer-based MIL framework is proposed to overcome the limitation of segmentation performance caused by the lack of correlation between instances. In addition, deep supervision is introduced to strengthen the constraints. The experimental results show that the weakly supervised segmentation method proposed in this paper is effective.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1.From the perspective of the characteristics of the semantic segmentation task, this paper proposes to improve segmentation performance by overcoming the problem of independent instances in MIL. 2.Weakly supervised methods are relatively sufficient in the experiment, which can prove the effectiveness of Swin Transformer Based MIL.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1.The use of deep supervision to better use image-level annotations and strengthen constraints is already prevalent in other works and cannot be described as a contribution proposed in this paper. 2.The interpretability of the fusion strategy is insufficient and the exact fusion of the side-outputs is unknown,which means the definition of is not clear. In other words, the features of the side-output of each layers are different, which means that different fusion strategies bring different effect. 3.There is an ambiguity between the visualization results and the experimental results in this paper. The visualization results of UNET are very close to the results obtained by Swin-MIL , and it seems that the results of the Swin Transformer Based MIL outperform those of the UNET, which is inconsistent with the comparison of the performance on F1-scores and Hausdorff Distance between them.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This paper is reproductive.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

1.In page 3, the author defined Y_n \in {0,1}, and in the last line of page 4, the author defined, where the author confused this confusion between Y_n with \hat(Y_n). 2.The author did not specify whether as the probability of the pixel in the th image being positive or negative. 3.The full name of CAM was not given when it first appeared in the paper.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The weakly supervised semantic segmentation is interesting. This paper is recommended as acceptance due to the contributions and motivation claimed in the manuscript.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper proposed a transformer-based multiple instance learning for pathology semantic segmentation. Although the methodology is proposed by combing several strategies including Swin Transformer, deep supervision and multi-instance learning, the experimental results demonstrated better performance. The reviewers also affirmed the merits of this paper. The issues including unclear hyperparameter optimization, further explaining how Swin Transformer considers the relations, and some other implementation details mentioned by reviewers. Please address these concerns in the final version.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

1

Author Feedback

Hyperparameters: Learnable weights will cause over-segmentation problems and degrade performance. With the limited supervision of image-level annotation, MIL methods with learnable weights have a tendency to perform better classification but not segmentation. Thus, we adopt a strategy, using fixed weights, to preserve multi-scale information. The fixed weight values we set are optimal settings determined by experiments.

Further explanation of Transformer building relationship and the choice of Swin Transformer: Transformer employs self-attention mechanism by multi-head self-attention. Self-attention mechanism computes the response at a position in a sequence by attending to all positions and taking their weighted average in an embedding space. That is, self- attention mechanism aggregates contextual information from other instances in a bag in MIL. By weighting value with an attention matrix, self-attention mechanism increases the difference between classes, which is the distance between foreground and background in semantic segmentation. Therefore, the feature maps from Transformer implicitly include relationships between instances in MIL. Among Transformer methods, Swin Transformer constructs multi-scale feature maps, which has demonstrated its effectiveness and efficiency on segmentation tasks. On the other hand, Swin Transformer provides higher resolution feature maps than other Transformer methods, which is beneficial to facilitate prediction maps for lower upsampling ratio. Thus, we adopt Swin Transformer in our method. We will add a visualization of the feature maps to demonstrate the effectiveness of Transformer. Compared with the feature maps output by the CNN model, the feature maps of Transformer reflect the role of self-attention in establishing the relations between instances.

Other details: 1.Dice coefficient In the semantic segmentation task, the dice coefficient is equivalent to the F1-score (positive). We will add an explanation when the metrics first appear. 2.About image size and patch size The original resolution of each image is 3000×3000, and it is resized to 256×256 for training in all experiments due to the limitation of memory. In Swin Transformer, each image will be cut into several patches with the size of 4×4 for learning. 3.About the fusion layer Multi-scale feature maps are upsampled to the same size as raw images and fused by a weighted sum. 4.Weakness of the approach The high-quality boundary of cancerous regions in semantic segmentation with only image-level supervision still needs to be improved, which is the focus of our future work. 5.Failure cases We will add some failure cases.

Thanks for the reviewers’ valuable suggestions and we will address these concerns in the final version.

back to top

Transformer based multiple instance learning for weakly supervised histopathology image segmentation