Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Shaoteng Zhang, Jianpeng Zhang, Yutong Xie, Yong Xia

Abstract

Most existing weakly-supervised segmentation methods rely on class activation maps (CAM) to generate pseudo-labels for training segmentation models. However, CAM has been criticized for highlighting only the most discriminative parts of the object, leading to poor quality of pseudo-labels. Although some recent methods have attempted to extend CAM to cover more areas, the fundamental problem still needs to be solved. We believe this problem is due to the huge gap between image-level labels and pixel-level predictions and that additional information must be introduced to address this issue. Thus, we propose a text-prompting-based weakly supervised segmentation method (TPRO), which uses text to introduce additional information. TPRO employs a vision and label encoder to generate a similarity map for each image, which serves as our localization map. Pathological knowledge is gathered from the internet and embedded as knowledge features, which are used to guide the image features through a knowledge attention module. Additionally, we employ a deep supervision strategy to utilize the network’s shallow information fully. Our approach outperforms other weakly supervised segmentation methods on benchmark datasets LUAD-HistoSeg and BCSS-WSSS datasets, setting a new state of the art. Code is available at: https://github.com/zhangst431/TPRO.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43907-0_11

SharedIt: https://rdcu.be/dnwb9

Link to the code repository

https://github.com/zhangst431/TPRO

Link to the dataset(s)

https://drive.google.com/drive/folders/1E3Yei3Or3xJXukHIybZAgochxfn6FJpr

https://drive.google.com/drive/folders/1iS2Z0DsbACqGp7m6VDJbAcgzeXNEFr77


Reviews

Review #1

  • Please describe the contribution of the paper

    This work proposes a framework, termed TPRO, to incorporate language information into weakly supervised semantic segmentation task. Related knowledge collected from internet is also involved into this pipeline. Compared with previous works, the proposed method obtaines best performance on two datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • It involves the text information from MedCLIP trained on a large dataset to incorporate language to help boost the performance.
    • Pathological knowledge is also gathered from internet to increase the accuracy
    • The paper is well-organized
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    For the knowledge attention, it would be better to have an ablation study to eliminate the impact fof an increase in the number of attention modules. A simple way to do that is to feed only image feature to the knowledge attention, and then fuse the output with text embedings.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper has provided most of the details.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    There are some papers working on weakly supervised semantic segmenation with CLIP or transformer architecture[1,2, etc]. It would be better to discuss them.

    [1] Extract Free Dense Labels from CLIP. (ECCV 2022) [2] Multi-class Token Transformer for Weakly Supervised Semantic Segmentation. (CVPR 2022)

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work is the first work to leverage texst to improve the weakly supervised histopathology image segmentation. I think it is helpful for the community.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes TPRO - a novel approach that uses text to provide additional information along with images to generate pseudo labels via CAM for training weakly supervised segmentation models on histopathology. The proposed method outperforms other techniques on two benchmark datasets for weakly supervised segmentation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper proposes a useful and scalable technique involving multimodal data to improve performance of WSS methods - scrape the internet for textual information about the label and provide it to the model for additional guidance regarding segmentation. This idea is novel in histopathology, but in line with recent works focusing on integration of data from text and images.

    2. The paper is easy to follow and provides a detailed break-up of the model architecture. The method produces SOTA results on 2 datasets, and the gains from adding the text component specifically shown in the ablation studies.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Is there a reason for using 2 different encoders for knowledge text and label; why can’t we use either CLIP or ClinicalBERT for both of them?
    2. How sensitive is the model to the freezing of the text encoders, as well as to the presence of the adaptive layer? Do we see something similar to contrastive learning where the presence of a downstream projector layer significantly improves the embedding quality for that task?
    3. In general, can the architecture be simplified/become more performant by having a single module that applies a) self-attention on the knowledge encoder output b) cross attention on the image features at different depths with the knowledge features (instead of doing concat -> self attention)
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have shared details about the training algorithm, hyperparameters and the hardware used. They can also share the different libraries and their versions used in experimentation.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. There are some minor typos in figure 2. (Pilex-label correlation, losses are indexed from L_2 instead of L_1).
    2. Also, some details about the impact of weights of different losses on final performance can be added.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper proposes a novel approach to combine text and image information for WSS problems, and the results are impressive. This is consistent with current trends in computer vision, and this work will motivate the community to think about integrating different modalities of data to improve over image-only techniques. Recommended for acceptance.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This work is focussed on weak supervision for segmentation of histopathological data. It uses a combination of vision, text and class encoders to derive pseudo segmentation labels that can then be used to train a segmentation network in a supervised setting. The main contribution is the use of knowledge encoder to add extra information to the outputs of the vision encoder and label encoder.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • a novel investigation into the use of ‘text based knowledge’ from ClinicalBert output for extraction of additional features for pseudo-label generation
    • very well written and easily navigable
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • completely misses details on ‘vision encoder’. Is a black box.
    • not entirely clear how ‘knowledge input’ is aggregated for each example during inference (does it require internet search?) and what it may consist of. NB this is the main contribution of the paper
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Most details required for reproducibility are explained except for Vision Encoder (total blackbox). Supplementary presents additional hyperparameters in Table 1. Is code going to be provided?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • Please explain why no details of Vision Encoder are provided. At least some basic details are necessary in this case for reproducibility.
    • It is not entirely clear how ‘knowledge input’ is aggregated for each example during inference (does it require internet search?) and what it may consist of. Since this is the main contribution of the paper, could you please explain what kind of input does clinicalBERT receives for each sample? with examples.

    In 2.2. “we gather text representations of different subtype manifestations from the Internet and encode them into external knowledge via the knowledge encoder.” - what are these manifestations? It is unclear what kind of knowledge is used here and how this would help guide over image features. e.g. “tumor epithelial tissue is” - what kind of information is provided here? Also, how do you obtain label input in the first place?

    why do you only train on 8 epochs? Does it mean your Vision Encoder is already pre-trained?

    Tables 1-4 - units?

    Explain abbreviations of regions in Tables (make a reference to supplementary if necessary). e.g. TE/NEC, etc

    In equation 3 you mentioned 3 loss functions - L1,L2,L3. However in Figure 2, there are L4, L3, L2. Is there an error?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is exceptionally well written in terms of details, esp for reproducibility sake (except for details of Vision Encoder) and generally very easily readable. However, I am unsure of the technical contribution beyond the addition of a new input from clinicalBert (without any substantial changes it seems).

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This work proposes a framework, termed TPRO, to incorporate language information into the weakly supervised semantic segmentation task. Related knowledge collected from the internet is also involved into this pipeline. Compared with previous works, the proposed method obtains the best performance on two datasets. The three reviewers also affirmed the merits of this paper. The issues include the shortage of ablation study, the explanation of details on ‘vision encoder’, the explanation of aggregation of ‘knowledge input’, and some other implementation details mentioned by reviewers. Please address these concerns in the final version.




Author Feedback

We appreciate the positive feedback and suggestions from both the AC and reviewers on our work, and we will consider their suggestions to further improve the quality of this work in either the final version or future work.



back to top