Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Qin Liu, Zhenlin Xu, Yining Jiao, Marc Niethammer

Abstract

Interactive image segmentation has been widely applied to obtain high-quality voxel-level labels for medical images. The recent success of Transformers on various vision tasks has paved the road for developing Transformer-based interactive image segmentation approaches. However, these approaches remain unexplored and, in particular, have not been developed for 3D medical image segmentation. To fill this research gap, we investigate Transformer-based interactive image segmentation and its application to 3D medical images. This is a nontrivial task due to two main challenges: 1) limited memory for computationally inefficient Transformers and 2) limited labels for 3D medical images. To tackle the first challenge, we propose iSegFormer, a memory-efficient Transformer that combines a Swin Transformer with a lightweight multilayer perceptron (MLP) decoder. To address the second challenge, we pretrain iSegFormer on large amount of unlabeled datasets and then finetune it with only a limited number of segmented 2D slices. We further propagate the 2D segmentations obtained by iSegFormer to unsegmented slices in 3D images using a pre-existing segmentation propagation model pretrained on videos. We evaluate iSegFormer on the public OAI-ZIB dataset for interactive knee cartilage segmentation. Evaluation results show that iSegFormer outperforms its convolutional neural network (CNN) counterparts on interactive 2D knee cartilage segmentation, with competitive computational efficiency. When propagating the 2D interactive segmentations of 5 slices to other unprocessed slices within the same 3D volume, we achieve 82.2% Dice score for 3D knee cartilage segmentation. Code is available at https://github.com/uncbiag/iSegFormer.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_45

SharedIt: https://rdcu.be/cVRyZ

Link to the code repository

https://github.com/uncbiag/iSegFormer

Link to the dataset(s)

https://pubdata.zib.de/

Reviews

Review #1

Please describe the contribution of the paper

The paper shows how to use a pretrained Transformer model for interactive image segmentation. It obtains a model that uses a small number of clicks to segment knee cartilage MRI images.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Obtains good results in interactive 2D segmentation of knee cartilage MRI images. Obtains acceptable results in cross-domain evaluation.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The literature review is not written well enough to clarify how the proposed ISegFormer model differs from SegFormer[10]. This casts doubts on the method’s novelty. The method overclaims interactive 3D segmentation but only offers 2D segmentation that is compared with the state of the art. The 3D segmentation, achieved through segmentation propagation, obtains only so-so results that are not compared with any state of the art.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors have provided the code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Some details shoudl be added on what specific pretrained Swin transformer has been used. In page 6 the training set has 1521 images, which should not be correct since only 407 volumes are supposed to be used, with 3 images from each.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Obtains good results in 2D segmentation but unconvincing in 3D segmentation.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

The authors have addressed some of my concerns.

Review #2

Please describe the contribution of the paper

This paper presents a novel deep learning architecture based on a vision transformer (ViT) to solve interactive MR image segmentation. Interactive image segmentation takes in user input in addition to the image itself: the user may click on the image with a “positive click” to indicate a region that should be segmented as foreground and with a “negative click” to indicate a region that should be segmented as background. While ViTs have been applied to non-interactive image segmentation a number of times in the literature, this is the first use of ViTs in interactive image segmentation as far as I can tell (this is also what the authors claim). The authors compare their model to state-of-the-art CNN-based architectures (known for lighter memory requirements than ViTs) on 2D MRI knee cartilage segmentation. The authors make efforts to design their ViT-based architecture in a memory-friendly way with a Swin Transformer encoder and a lightweight MLP-based decoder.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

In my opinion, the paper’s greatest strengths are the novelty of the architecture used and the demonstration of what seem to be new state-of-the-art results using said architecture. The architecture uses simple building blocks proposed in other works (e.g., Swin Transformer blocks); however, the conscious effort to design a lightweight network that is memory efficient results in an effective tool for the task at hand. With a memory overhead nearly identical to state-of-the-art CNN-based models, their new architecture demonstrates improved 2D interactive knee cartilage MRI segmentation (see Table 1). The authors also do a good job of demonstrating their due diligence in architecture design. They demonstrate that other transformer-based backbones with similar numbers of parameters have a considerably larger memory overhead than their proposed architecture and worse speed. They also perform an ablation study investigating different pre-training datasets and fine-tuning configurations, comparing to CNN-based models, and demonstrate that their architecture demonstrates the best results overall across all configurations. Overall, their design of experiments seems to make sense. Finally, the extension to demonstrate how their results in 2D can be used to segment 3D images using pre-existing propagation techniques accurately wraps up the paper nicely and ties back to their ultimate goal of segmenting 3D images with limited training data.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Throughout the paper, a couple of methods are left under-explained, which hurts the reproducibility of the work (see comments regarding reproducibility below). However, this may have more to do with the authors’ efforts to fit their work into the word/page limit. A few methods/details are explained in the supplementary material, but a couple still seem to be under-explained even with the supplementary material. In addition, in many of the tested tasks (across domains) the improvements in performance seem very incremental (non significant) and many of the tasks are more compute vision and not medical datasets so might not be as interesting to the MICCAI community (though that does not take away from the contribution of the work here).
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Overall, the reproducibility seems to be good. The authors use public datasets for training/evaluation of their models (e.g., the OAI-ZIB dataset), and their methods are reasonably well-explained. There are, however, a couple of training-related methods that seem to be under-explained, which could impact the reproducibility of their work, should a researcher try to reproduce their exact methods:

There are a few questions that are unanswered about the automated click generation procedure during training/inference: (1) How are the clicks initialized for a given sample? (2) What if there are multiple false positive or false negative regions? Is a positive or negative click generated at the center of each distinct region?

As someone who is unfamiliar with training interactive segmentation models but very familiar with training non-interactive training models, how are samples/batches fed into the network? Especially since you’re adding clicks to false positive/false negative regions during training, I’m assuming you feed the same training sample into the network multiple times during the same epoch, with different click configurations. If this is the case, are you consecutively feeding the same training sample (with different click configurations), or do you interleave with other training samples/slices?
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

There are a few places where acronyms are missing definitions or should have been defined earlier. For example, on page 2, it would have been nice to define “STCN,” and I believe the acronym for CNN (following paragraph) should have been defined earlier, as CNN was used in the second paragraph of the intro.

In page 3, you assert that U-Nets are not memory-efficient due to their symmetric encoder-decoder architecture. While I understand what you’re saying, I’m not sure that symmetry between the encoder and decoder is sufficient for a network not to be memory efficient (in theory, I could imagine there could be such a thing as a “memory-efficient” symmetric architecture). There are works in the literature that support your point (i.e., that networks with a smaller decoder may be more efficient) – perhaps you could elaborate on this or cite one of these works (e.g., Paszke et al.’s ENet from 2016).

In Fig. 1, the arrow going around “interaction loop” seems to be going in the opposite direction of what is indicated by the rest of the flow diagram. Also, I believe there may be a typo in the iSegFormer architecture graphic, but I’m not sure. If this is for segmentation, shouldn’t the output size be H x W x N_cls? (as opposed to H/4 x W/4 x N_cls)

In the first paragraph of your experiments section, it is unclear why you choose to focus on cartilage segmentation exclusively instead of femur or tibia segmentation. I understand your reasoning for choosing one for a short conference paper, but you may want to justify this decision in a larger paper.

I wasn’t sure why slices per second were the metric of choice for speed instead of seconds per slice? Slices per second seem to imply that the process of loading new slices is implicated in the measurement of speed, although this seems like it would be independent of network architecture. If what we’re trying to measure is the speed for the network to make a prediction on a single slice (I could be wrong), it seems like seconds per slice would be a more intuitive metric?

The cross-domain evaluation experiment was interesting and felt a bit strange at first. I didn’t expect a model trained exclusively on COCO+LVIS data to generalize to medical imaging, let alone three separate tasks. You may want to discuss your reasoning for conducting this experiment in more detail for a larger paper. Would we expect to encounter a scenario where we need to train a model on non-medical imaging data and translate it, without fine-tuning, to the medical imaging domain? On an unrelated note, I appreciated how you still included results demonstrating that CNN-based models generalized better in this experiment than the transformer-based models.

In the cross-domain evaluation first paragraph, you say you trained models (in the previous experiment) using 1,221 slices. This seems to be at odds with your earlier statement that you used “1521 training slices, 150 validation slices, and 150 testing slices.” Perhaps one of these numbers is a mistake?

In a larger paper, I think your ablation experiment merits more discussion. The HRNet32 model seems to outperform the Swin-based models in several ablation configurations, although the Swin-based models seem to achieve the best results overall (pretrained on ImageNet-21K with fine-tuning).
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The weaknesses were relatively few, mainly related to the under-explanation of specific training details. I think this under-explanation was likely the product of trying to fit their work into the page constraint, and I don’t doubt that their results could be reproduced if these method details were elaborated on. Meanwhile, the novelty of their methods/architecture and demonstration of superior results (state-of-the-art using an architecture not used in this style of task) seem to have the potential to move the field forward.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Not Answered
[Post rebuttal] Please justify your decision

Not Answered

Review #3

Please describe the contribution of the paper

The authors proposed an interactive segmentation method using a memory-efficient method combining a Swin transformer with a lightweight multilayer perceptron decoder. They applied their method to interactive 2D medical image segmentation on the public OAI-ZIB dataset for the segmentation of knee cartilage on MRI. The authors claim their method’s performance is superior to its CNN counterparts while achieving comparable computational efficiency. They further extended their transformer model to 3D interactive knee cartilage segmentation borrowing techniques from video analysis to extend 2D slice-based segmentations into 3D within the other previously unsegmented image slices.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Super-interesting approach, especially the use of video-based methods to extend into 3D (which makes conceptual sense) although this aspect is not the main contribution of the work according to the authors (section 3.4)
- Code was made available
- Public datasets were used
- The comparison between methods was fair in that the different models were trained on the same datasets and evaluated on the same datasets
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- There are no statistics supporting any claims of superior performance. Just because one number appears higher (or lower) than another, does not mean that this perceived difference is statistically significant.
- The manuscript lacks error estimates. Whenever a performance metric is reported, whether this is the number of clicks, Dice, etc etc, this should be done in the form of a mean value and an error estimate such as a 95% confidence interval or, at a minimum, a standard deviation.
- I am unsure of the clinical utility of having a method that requires, say, 20 clicks to obtain a satisfactory segmentation (85% IoU is high though, so if less stringent then this number of clicks would be less). In my area of research which involves 2D/3D lesion segmentation, the number of clicks required by a clinician is limited as much as possible to 1 (approximate center), 2 (2D bounding box), or 3 (3D bounding box). Please comment.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Code was made available and the datasets used are publicly available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
Minor comments:
- Caption of table 1: say ‘difficult cases’ rather than ‘hard cases’
- Table 1 shows performance for 2D slices in terms of number of clicks required to obtain a pre-determined level of performance in terms of IoU. Table 3 shows 3D segmentation performance using different #2D slices to start propagation into 3D. it is unclear to me, however, which 2D segmentations were used to start propagation; the ones with 85% IoU or the ones with 90% IoU, or even 2D ‘ground truth’ segmentations? How does the location of the 2D starting slices impact the end result of the 3D segmentation?
- Why use Dice to evaluate the 3D segmentations instead of IoU which was used earlier for 2D when determining #clicks? I’d like to see IoU (which I think is preferred over Dice but I know Dice is used extensively) along with Dice.
- Related to the clinical utility comment above under weaknesses, could you comment on “how good is good enough” for this segmentation problem? There usually is a substantial inter-observer variability when determining a reference standard for segmentation problems and it would be good to know how your method compares to the variability in the reference standard itself, e.g., if the average IoU for the ‘ground truth’ of 2 clinicians is only 0.6 then your requirement of 0.8 or more may be making things unnecessarily difficult.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think this is an interesting approach but I’d like to see a bit more in the Discussion about envisioned clinical utility.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

The authors state that they will include statistics in a revision but in the mean time don’t provide any insight into what those might be in their response. Claims of superiority (or equivalence) always need to be backed by appropriate proof.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
This work proposes the use of a pretrained transformer for interactive segmentation. The method is validated in several datasets with a final clinical application of knee segmentation from MR images. All the reviewers have a very positive impression of this work, acknowledging its originality, clarity and thorough experiments. Nevertheless, there are some remarks that have been raised by the reviewers that could still be addressed. These include:
- Clarify if the method is truly 3D or if the 3D is achieved through propagation of 2D segmentations (R1).
- Clearly establish the differences with SegFormer[10] (R1).
- Clarify aspects of the training procedure (see R2’s remarks on reproducibility)
- Back up claims of significant differences with statistics (see R3)
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

3

Author Feedback

We thank all our reviewers for their insightful comments. We first address the comments summarized by the meta-reviewer. (1). [R1] Clarify if the method is truly 3D or if the 3D is achieved through the propagation of 2D segmentation. The proposed iSegFormer is a 2D method, but we extend it to 3D images with an existing mask propagation model. (2). [R1] Clearly establish the differences with SegFormer. iSegFormer is an interactive segmentation model while SegFormer is not; iSegFormer is much more memory-efficient and faster than SegFormer as shown in Tab. 3. This is because iSegFormer uses Swin Transformer as the encoder, while SegFormer uses a vanilla ViT-based encoder. (3). [R2] Clarify aspects of the training procedure (see R2’s remarks on reproducibility). a. How are the clicks initialized for a given sample? Our click simulation process is: 1) obtain the FP and FN error maps (both are binary masks) based on the segmentation and the ground truth; 2) transform the two error maps into two distance maps; 3) pick the point with the highest value in the two distance maps as the simulated click. The click is positive (negative) if it is the from FN (FP) error map; 4) the click is encoded as a small disk in the encoding map. b. What if there are multiple FP or FN regions? This is fine. After transforming to distance maps, we just choose the point with the highest value. c. Is a positive or negative click generated at the center of each distinct region? Yes. We add small random perturbations to avoid overfitting during training. d. How are samples fed into the network? Your comments are correct. We iteratively simulate clicks for a training batch. We set the maximum number of iterations to N, and each batch is uniformly sampled from 0 to N iterations (N=3 works best). (4). [R3] Back up claims of significant differences with statistics. This is a valid critique. Though few prior works on interactive segmentation reported such statistics (we could be wrong), we will add the statistics in the revised version.

We now address specific reviewer comments. (5). [R1] The 3D segmentation results are not good enough; no comparison with SOTA. In DeepAtlas (Zhenlin Xu, et. al, MICCAI’19), the authors obtained 81.2% DSC with 10 labeled images, while we obtained 85.1% with 10 labeled images. This can serve as evidence that our 3D segmentation results are acceptable. We leave the improvement for 3D segmentation in future work. (6). [R1, R2] The number of training slices should be 1221, not 1521. Yes. We will correct this typo. (7). [R2] Shouldn’t the output size be HxW instead of H/4xW/4? We have a non-learnable upsampling operation that is not included in the figure. (8). [R2] Suggestions on extending to a larger paper. We appreciate all your great suggestions, which we will definitely consider in the revised version. (9). [R2] Reasoning for conducting the cross-domain evaluation. Our method is class-agnostic (ie., trained as a binary segmentation task), so it is interesting to know if we can transfer knowledge from label-rich domains (natural images) to label-scarce domains (medical images). This also serves as an evaluation of the generalizability of the proposed method. (10). [R3] Which 2D segmentations were used to start propagation? We used the gold standard for propagation because one may segment an object close to the gold standard with enough clicks. We evenly chose the slices for propagation. (11). [R3] Why use Dice instead of IoU for 3D segmentation? The two metrics are inter-changeable: (1/IoU + 1) = 2/Dice. (12). [R3] Concerns about clinical utility. For the challenging cartilage segmentation (labeling one slice may take more than 10 minutes), 20 clicks are still very helpful. (13). [R3] How good is good enough? Great point. As you mentioned, perfect segmentation is not necessary due to the inter-/intra-observer variability. We do plan for clinical evaluation, so we will know how good is good enough for cartilage segmentation.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The rebuttal has addressed most of the reviewers comments. An important point regarding the statistical analysis to back up the claims on the superiority of the method was not fully addressed. Being this an interesting work, it consider it can be presented at MICCAI. However, it is important that the authors include in the final version the amendments that have been suggested by the reviewers, specially those concerning the statistics to back up their claims regarding the superior performance of their work.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

5

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

All reviewers recommend accepting this paper based on the quality of results and sufficient relative novelty. The rebuttal addresses most reviewer concerns, particularly the lack of technical clarity. The final version should include all reviewer comments and suggestions.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

1

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors here propose a very interesting work on interactive segmentation. The field needs more of these efforts and I welcome this contribution. Experiments were good and even included 3D variants. Although I agree with R3 that the direct clinical utility may be suspect if there are too many clicks, another application of this work could be for computer-aided annotation to quickly label large datasets to train downstream fully automatic methods, so I believe that this, and other user-interactive approaches, are useful even when not directly deployed to the clinic.

However, I did find the authors response to the statistical question troubling as they almost brushed it aside. For certain, it is best practices to always report spread (as a bare minimum) regardless of what selection of papers they referenced used. I would have much rather the authors specified how they will concretely address the concern. For this reason, while I do gladly recommend accept, the paper is not ranked as high in my stack as it could have easily been. I encourage authors to earnestly address what reviewer concerns they are able to into the final version of their paper.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

5

back to top

iSegFormer: Interactive Segmentation via Transformers with Application to 3D Knee MR Images