Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Wentao Liu, Chaofan Ma, Yuhuan Yang, Weidi Xie, Ya Zhang

Abstract

The goal of this paper is to interactively refine the automatic segmentation on challenging structures that fall behind human performance, either due to the scarcity of available annotations or the difficulty nature of the problem itself, for example, on segmenting cancer or small organs. Specifically, we propose a novel Transformer-based architecture for Interactive Segmentation~(TIS), that treats the refinement task as a procedure for grouping pixels with similar features to those clicks given by the end users. Our proposed architecture is composed of Transformer Decoder variants, which naturally fulfills feature comparison with the attention mechanisms. In contrast to existing approaches, our proposed TIS is not limited to binary segmentations, and allows the user to edit masks for arbitrary number of categories. To validate the proposed approach, we conduct extensive experiments on three challenging datasets and demonstrate superior performance over the existing state-of-the-art methods.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_67

SharedIt: https://rdcu.be/cVRwU

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper describes a method to incorporate user interaction into the segmentation process. The method is validated using datasets containing lungs, colons, and pancreas. An ablation study is also provided.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The results of the proposed method are better than five previous methods. The results improves fast with more user interaction. Also, ablation study shows that the new components (click encoding and label assignment) both contribute to the improvement of the results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The descriptions of the method is not specific enough and sometimes confusing, see detailed & constructive comments” below.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Some important details are missing.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    There are several points which should be more clear and consistent.

    (1) The descriptions of click encoding and label assignment are not specific enough:

    The authors state “$\phi_{index}$ refers to an indexing function that simply extract the features from the dense feature map”, but it is unclear what feature extraction means here.

    Also, in the statement “$\phi_{CE}$ refers to a projection from the category labels to high-dimensional embeddings” it is unclear how the projection between the category label and the high-dimensional embeddings works.

    (2) It is confusing in which steps the user interaction (clicks) is used.

    On the one hand, in Sec. 2.3 it reads “After stacking 6 layers of click encoding and label assignment modules, we adopt a linear layer to read out the segmentation labels, and train it with pixelwise cross-entropy loss.”, so it seems that clicks are involved during training.

    But on the other hand, according to Sec. 2.1 (“allow end users to refine its own output by incorporating feedback during inference”), the clicks are only used during inference, but not during training.

    (3) In Fig. 2, subfigure D1 (Lung Cancer), D3-1 (Pancreas), and D3-2 (Pancreas Cancer), there are several curves which almost do not change with increasing number of clicks. In subfigure D2 (Colon Cancer), the green curve even drops when there are more clicks. This is strange since all compared methods are interactive methods, so their results are supposed to improve if more clicks are used. Is there an explanation for this phenomenon?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall results are good, but important details are missing. Also, there are unexpected observations in the comparison with previous methods.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Authors propose a framework based on transformers networks and user clicks to segment any structures and tries it on challenging ones, such as those from organ cancer. It refines automatic segmentations through the addition of click annotations, and it is denominated as Transformer-based architecture for Interactive Segmentation (TIS). They demonstrate their results on three different datasets

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The major strength is that most of segmentation methods, in order to have less error, they have to be trained for the specific application. This method ensures they can segment most cases of cancer organs, doing a refinement through clicks given by the users. This method is natural for the clinicians for example. Where they are used to delineate manually the structures and lesions. So usability is a must on this paper.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -The paper defines a method with only tests on challenging datasets Users did not try their method on real applications or in collaboration or supported by medical personnel of a final user. -The assumption of clicks on working progress is not new and other studies have used them before with simpler method. -Perhaps the authors could elaborate why their results are significant against previous methods.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the paper is possible if authors submit the code to the project . Databases are available for tests . Perhaps a deeper explanation of the software modules would be necessary in order to reproduce it correctly. But it can be done if the code is well documented.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The paper has a good organization and the method is explained with enough detail. Now the next step is to use real data that the system can test. And also a usability evaluation with final users.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Over all the paper is innovative, but the technical approach needs perhaps a sequence diagram in order to completely understand each part.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper
    This paper proposes TIS (transformers interactive segmentation), a new network architecture to allow for the refinement of segmentation inferences from a (multi-class) segmentation network (\Phi_ENC \Theta_e) using clicks (i.e. xy positions and labels) encoded through a click encoding network (\Phi_REF \Theta_r). In these experiments the segmenter is a U-Net but can in principle be any type of encoder decoder. The click encoder (the main contribution) is transformers based and takes as inputs both the click coordinates and the (vectorized) encoder output. This step is followed by a “label assignment” mechanism whose purpose is to learn to balance the contribution of the clicks wrt similarity to ground truth labels.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written, the proposed architecture is is quite innovative (although I am no expert in interactive segmentation), and results are very convincing, with consistent improvement improved by both the click encoding mechanism as well as the label assignment step.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I am missing a technical aspect. I fail to understand why the label assignment brings such an addition to the click encoding only, and why click encoding only perform so poorly. As I understand the label assignment includes the click encoding step. However based on the results illustrated in table 2, label assignment leads to a stunning +8% dice in cancer segmentation compared to click encoding only, while results using click encoding do not seem to benefit at all of increased number of clicks. (-1% Dice between 5 and 10 clicks). Maybe this balance between the two steps should have been clarified better for non proficient readers.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    handling UI is always a bit tricky so open sourcing would be welcome.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    This paper proposes TIS (transformers interactive segmentation), a new network architecture to allow for the refinement of segmentation inferences from a (multi-class) segmentation network (\Phi_ENC \Theta_e) using clicks (i.e. xy positions and labels) encoded through a click encoding network (\Phi_REF \Theta_r). In these experiments the segmenter is a U-Net but can in principle be any type of encoder decoder. The click encoder (the main contribution) is transformers based and takes as inputs both the click coordinates and the (vectorized) encoder output. This step is followed by a “label assignment” mechanism whose purpose is to learn to balance the contribution of the clicks wrt similarity to ground truth labels.

    Validation is performed by simulating automatic clicks in regions where the encoder failed based on the ground truth, and evaluating the improvement with increasing numbers of clicks. The authors provide a comparison to many other interactive segmentation methods (which I am not familiar with) and demonstrate significant and consistent improvement over SOTA.

    The paper is well written, the proposed architecture is is quite innovative (although I am no expert in interactive segmentation), and results are very convincing, with consistent improvement improved by both the click encoding mechanism as well as the label assignment.

    I am missing a technical aspect. I fail to understand why the label assignment brings such an addition to the click encoding only, and why click encoding only perform so poorly. As I understand the label assignment includes the click encoding step. However based on the results illustrated in table 2, label assignment leads to a +8% dice in cancer segmentation. Results using click encoding do not seem to benefit at all of increased number of clicks. (-1% Dice between 5 and 10 clicks), while impressive results are achieved using

    I however strongly recommend to accept this very interesting paper.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a very original contribution in the rather niche field of interactive segmentation. The architecture is very generic and it is therefore likely to have a large range of potential applications in the field of medical image segmentation.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The paper proposes a Transformer based framework for interactive image segmentation by grouping pixels with similar feature representation to those clicks given by the end users.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper proposes an effective way to utilize the Transformer to transfer users’ annotations into the pixels with similar representation. The approach can also be applied to multi-class segmentation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The details of the model training is not clear. What’s the size of the user click during the training step? Is the Encoder trained with the Decoder for nn-UNet? What is the final loss function?
    2. It would be nice to have more detailed caption for Fig.1. Like what is the Cat Enc represent for?
    3. In experiments, it shows that the proposed method can only outperform baseline methods after a certain click number (more than 5). What’s the reason that the proposed method needs more click compared to other methods?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Meet the requirement.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    See the weaknesses.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes an interesting method to utilize Transformer to measure the similarity between pixels to assign semantic labels.

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This work presents a transformer-based framework that allows interactivity to refine predicted segmentation masks. One of the key contributions of this work is the design of an ecoder network that allows to encode the provided clicks. All the reviewers have agreed on the quality, relevance and novelty of the work. Despite the positive impression, the reviewers have some remarks which, although minor, would highly improve the quality of the paper when addressed. I recommend the authors to go through them and answer/address them.

    AC recommendations

    • In the abstract, the author motivate their work by mentioning the difficulties to segment small organs. as there is no evidence that this work may assist in the task, given that experiments are hel on rather larger organs, it is advised to remove any mention to small organs as it can be misleading.
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2




Author Feedback

We thank all reviewers for their constructive comments. Here we aim to resolve the concerns.

To R1:

Q1: The descriptions of click encoding and label assignment are not specific enough. A1: “$\phi_{index}$” means to index the vectors from dense features corresponding to the user’s clicks. “$\phi_{CE}$” aims to convert the one-hot category label into a high-dimensional vector, which is done by a learnable MLP.

Q2: It is confusing in which steps the user interaction (clicks) is used. A2: The user interaction should be adopted at inference stage. However, to properly train the model, we have to simulate the user interactions at training stage as well, with groundtruth annotations.

Q3: On Colon Cancer, BS-IRS performs worse when there are more clicks. A3: We agree with the reviewers, and indeed find several interactive approaches may suffer from this issue. This is due to the stochastic nature from users’ input, in other words, the performance can be improved, only if the user’s click is informative to alleviate the error predictions. As future work, we will investigate algorithms to automatically suggest the most informative points to users.

To R2: Q1: The effect of label assignment. A1: This is because the label assignment allows to directly copy the label information from users’ annotation, in other words, without it, the model loses the ability to encode categories information, and unable to propagate it to the pixel embeddings.

To R3: Q1: The methods only tests on challenging datasets Users did not try their method on real applications. A1: We agree with the reviewer, it would be of great interest to deploy our system for real applications, and we are indeed developing an easy-to-use software with GUI at the moment.

Q2: The assumption of clicks on working progress is not new and other studies have used them before with simpler method. A2: Eexisting approaches usually require the clicks to meet certain conditions, for example, boundary point, extreme point, etc. However, in our case, we do not enfource special requirements, allowing more flexibility for interaction, as well as demonstrating better performance.

To R4: Q1: The details of the model training is not clear. A1: Due to the space limitation, we were unable to include all the implementation details, we will make all codes available. During training, we randomly pick single pixels as user click. The nn-Unet model was pre-trained and not updated with the decoder. As for training loss, we adopt the pxielwise cross-entropy loss, as mentioned in Sect. 2.2.

Q2: Fig.1 needs more detailed caption like what is the Cat Enc represent for? A2: Thanks for pointing this, we will improve the caption of Fig.1. Cat Enc refers to “Category Encoding”.

Q3: The proposed method can only outperform baseline methods after a certain click number (more than 5). A3: Existing baselines require users to give clicks on the structure boundary, which tends to present more informative interaction, but certainly more difficult and time-consuming from the user’s perspective, in contrast, our proposed method allows more flexible interactions, and 5 clicks fall in the acceptable range for users’ interactions.

To Meta-reviewer: Q1: It is advised to remove mention to small organs as it can be misleading. A1: We mention to small organs because the goal of this paper is to refine the automatic segmentation on challenging structures including small organs. On larger organs, automatic methods have performed well enough.



back to top