Authors

Zehui Liao, Shishuai Hu, Yutong Xie, Yong Xia

Abstract

Manual medical image segmentation is subjective and suffers from annotator-related bias, which can be mimicked or amplified by deep learning methods. Recently, researchers have suggested that such bias is the combination of the annotator preference and stochastic error, which are modeled by convolution blocks located after decoder and pixel-wise independent Gaussian distribution, respectively. It is unlikely that convolution blocks can effectively model the varying degrees of preference at the full resolution level. Additionally, the independent pixel-wise Gaussian distribution disregards pixel correlations, leading to a discontinuous boundary. This paper proposes a Transformer-based Annotation Bias-aware (TAB) medical image segmentation model, which tackles the annotator-related bias via modeling annotator preference and stochastic errors. TAB employs the Transformer with learnable queries to extract the different preference-focused features. This enables TAB to produce segmentation with various preferences simultaneously using a single segmentation head. Moreover, TAB takes the multivariant normal distribution assumption that models pixel correlations, and learns the annotation distribution to disentangle the stochastic error. We evaluated our TAB on an OD/OC segmentation benchmark annotated by six annotators. Our results suggest that TAB outperforms existing medical image segmentation models which take into account the annotator-related bias.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43901-8_3

SharedIt: https://rdcu.be/dnwCD

Link to the code repository

https://github.com/Merrical/TAB

Link to the dataset(s)

https://deepblue.lib.umich.edu/data/concern/data_sets/3b591905z

Reviews

Review #3

Please describe the contribution of the paper

– Propose a novel medical image segmentation framework for multi-annotated datasets based on DETR structures, which can simultaneously learn the bias of each annotator and evaluate the meta-annotation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

– The paper is well written and easy to follow. – The motivation is clear, and the techniques selected to implement the idea is sound. Using the query tokens of DETR to learn the bias of each annotator is reasonable. – The experimental settings are convincing. The implementation details are clearly presented. Cross-validation is performed on two public available datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

– The quantitative improvement is marginal. The dice scores of the proposed method only improves about 1% compared to the second-best methods. – May Have a higher computational complexity. The calculation derived from the multivariant normal distribution may become the complex bottleneck of the entire framework, although the authors applied low-rank tools to reduce the computation.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The code is attached as supplemental material. The datasets are public available.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

– As shown in Table2, the independent Gaussian distribution also achieved competitive results. It requires much less computational amount than the multivariant normal distribution. Therefore, I wonder if the independent Gaussian distribution is good enough here. In this case, I suppose the authors provide the metrics for computational complexity analysis, e.g., FLOPs, GPU memory cost, and model parameter number, for these models in this table. – The datasets for experiments may be so simple that the results of the compared methods are not very similar (got a dice score above 0.96). This is the reason why the improvement of the proposed method looks marginal to the others. I suggest the authors conducted experiments on other challenges medical segmentation datasets, to certify the consistent advantages of the proposed method. – Whether one specific query token does take the annotations from the same annotator? It should be clearly declared.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

– Good idea and reasonable modeling. – The results are a bit weak.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper presents an image segmentation framework to model annotators preference and stochastic errors of training labels. The method can learn a segmentation task using labels from multiple annotators and at inference, produce a meta segmentation as well as annotator-specific segmentation maps.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The tackled problem for model training using information of multiple annotators is an interesting and specifically relevant task in medical image segmentation settings.
- The method is technically sound and the choice of integrated components is clear.
- The comparison with relevant methods and the provided ablation study is adequate.
- A concise yet inclusive literature review of the relevant methods is provided.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- My main concern is around the significance of the obtained improvements. The increase in provided metrics against compared methods in Table 1 seems to be fairly marginal.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The paper has a high reproducibility and the training details are provided. The code of the implemented networks is also provided in the supplementary materials.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
The tackled problem is an important and relevant problem in various medical segmentation tasks. The method includes an interesting utilization of transformers for this task, such as the proposed application of queries to represent annotators’ preferences.
- My main concerns is the observed marginal improvement in experimental data. Could the authors provide power analysis to show that the obtained improvements are statistically significant.
- The comparison of model size and run time could further add value to the paper.
- The methods seems to be general with no specific ties to fundus image segmentation. The authors could possibly add other medical datasets (US, CT, MR, etc) for experiments to further support papers claim. This suggestion may be added to current paper (if rebuttal time and paper space limits allows) or in future extensions of the work.
Minor comment: – The use of “/” character in last paragraph of section 2.4 (page 5) may be confusing: the character could denote “or” or “division operator”.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The methodology is an interesting adaptation of transformers and stochastic prediction to incorporate multi-annotator information and assess predictive uncertainty. My main concern in unclear significance of obtained results (marginal improvements) and hence my recommendation is weak accept.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

The proposed transformer based model can produce annotator specific segmentation and fused segmentation outputs in a single model.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The method is novel in certain extent that uses transformer to learn the distribution of the segmentation from multiple annotators.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Given only 95 test images, the performance improvement is not significant when compared to other methods and without any statistical tests to support the findings.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The network structure is quite complex, which is difficult to be reimplemented from scratch and reproduce the results.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- The main concern is that the performance differences between different methods are quite small without statistical test to evaluation the significance.
- Given the proposed network has a complex structure, it would be desirable to report the memory consumption and number of parameters required.
- It is not obvious if the proposed the method is favored compared to other methods in figure 2. It requires further explaination.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method has certain novelty which may be interesting to the community.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The problem of accommodating multiple annotators in medical image segmentation is interesting and relevant. The paper is well written; the motivation and methodology are clear and the results are convincing. However, the dataset used is of limited size and the increase in performance is limited; the implications of the results to real-world application should be discussed. The computational complexity and memory consumption should be discussed. The specific relevance of the approach to fundus imaging should be discussed in more detail as well as generalisability to other modalities.

Author Feedback

We sincerely thank all reviewers and ACs for their recognition of the novelty and clinical significance of this work. Here are responses to their invaluable suggestions and remaining concerns.

Q1. Significance test of comparison results (AC, R1, R2, R3)

The performance gain of our TAB on optic disc segmentation task is marginal, we attribute this to the fact that the boundary of optic disc is clearer and this task is somewhat simple, hence the performance on this task tends to saturate. However, we would like to highlight that our TAB performs significantly better than the best competitor on the more challenging optic cup segmentation task, which has an ambiguous boundary (p-value<0.05). As suggested, we will validate our TAB and other competing methods on other challenging datasets in the future extension of this work. The performance of Our TAB and the best competitor is shown as follow.

Average (Disc): 96.70 v.s. 96.54; p-value=1.0 Average (Cup): 85.52 v.s. 84.32; p-value=1.6e-5 < 0.05 Mean Voting (Disc): 97.82 v.s. 97.86; p-value=0.2 Mean Voting (Cup): 88.22 v.s. 87.77; p-value=0.002 < 0.05

Q2. Validation on other larger datasets that are different modalities (AC, R2, R3)

Although we have only validated our TAB on the optic disc/cup segmentation dataset, TAB has no specific design for fundus images. We will prove the effectiveness of our TAB on other medical datasets whose modalities are different in the extension of this work. Moreover, the dataset currently used is limited in size, and we will select a larger dataset to valiate our TAB in the extension of this work.

Q3. Implications of the results to real-world application (AC)

The success of deep neural networks relies heavily on accurately labeled training data, which are often unavailable for medical image segmentation tasks since manual annotation is highly subjective and requires the observer’s perception, expertise, and concentration. TAB is designed to address the issue of annotator-related bias that existed in medical image segmentation. It aims to predict accurate meta segmentation and annotator-specifc segmentation in parallel. Therefore, TAB can provide specific segmentation maps for doctors who labeled the training set for reference and generic segmentation maps for other doctors for reference. Moreover, accurate segmentation map can faciliate the following diagnosis.

Q4. GPU memory cost and the number of parameters (R1, R2, R3)

Compared to the baseline model (i.e., M_r), which contains a CNN encoder and a CNN decoder, our TAB introduces a PFE module containing four multi-head attention modules and two feed-forward networks, and constracts an SS head which adds three convolutional layers after the CNN decoder. Consequently, the M_r model has 22.01069 million parameters and TAB has 23.25021 million parameters. The GPU memory cost of all competing methods is shown as follow.

M_r: Params=22.01069 M; GPU cost=2339 MiB MH-UNet: Params=22.04190 M; GPU cost=2575 MiB MV-UNet: Params=22.01069 M; GPU cost=2339 MiB MR-Net: Params=81.18959 M; GPU cost=14359 MiB CM-Net: Params=22.26511 M; GPU cost=7521 MiB PADL: Params=22.04395 M; GPU cost=3345 MiB AVAP: Params=22.06952 M; GPU cost=4097 MiB Ours: Params=23.25021 M; GPU cost=7006 MiB

Q5. Explaination about Figure 2 (R1)

Compared to the recent competing methods, i.e., PADL and AVAP, our TAB can produce smoother segmentation boundaries when the boundary of object is fuzzy, such as the optic cup.

Q6. About preference query (R3)

Given an input image, the r-th predicted annotator-specific segmentation map is generated based on the r-th preference query and supervised by the annotation provided by the r-th annotator. We will revise the manuscript and clarify this.

back to top

Transformer-based Annotation Bias-aware Medical Image Segmentation