Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Shengcong Chen, Changxing Ding, Dacheng Tao, Hao Chen

Abstract

Nucleus segmentation is usually the first step in pathological image analysis tasks. Generalizable nucleus segmentation refers to the problem of training a segmentation model that is robust to domain gaps between the source and target domains. The domain gaps are usually believed to be caused by the varied image acquisition conditions, e.g., different scanners, tissues, or staining protocols. In this paper, we argue that domain gaps can also be caused by different foreground (nucleus)-background ratios, as this ratio significantly affects feature statistics that are critical to normalization layers. We propose a Distribution-Aware Re-Coloring (DARC) model that handles the above challenges from two perspectives. First, we introduce a re-coloring method that relieves dramatic image color variations between different domains. Second, we propose a new instance normalization method that is robust to the variation in foreground-background ratios. We evaluate the proposed methods on two H&E stained image datasets, named CoNSeP and CPM17, and two IHC stained image datasets, called DeepLIIF and BC-DeepLIIF. Extensive experimental results justify the effectiveness of our proposed DARC model. Codes are available at https://github.com/csccsccsccsc/DARC.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43987-2_57

SharedIt: https://rdcu.be/dnwKc

Link to the code repository

https://github.com/csccsccsccsc/DARC

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a new re-coloring method (RC) based on the Sort-Match algorithm, and a novel distribution-aware instance normalization method (DAIN). The Sort-Match based re-coloring algorithm is related to two previous works (Zhang et al. CVPR 2022[41] and Rolland et al. JEI 2000[42]), but it well adapts to re-coloring histopathology images. The proposed DAIN module is related to instance normalization and adaptive instance normalization, but it makes use of the estimated foreground-background ratio. Importantly, in Table 2, the two proposed modules and their combination are compared with existing stain and instance nomalization methods, and show their advantages.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors have done extensive experimental comparisons. The proposed re-coloring (RC) method is compared with existing stain normalization methods. The proposed distribution-aware instance normalization (DAIN) is also compared with existing normalization tricks. The effectiveness of the combination of RC and DAIN are also studied.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Are there any key hyper-parameters in these two proposed techniques? How to set these hyper-parameters? What is the effect of different setting of hyper-parameters?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors claim in the abstract that they will released the code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    It is suggested to released the code and model weights after acceptance.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors propose two novel techniques, and have well show their effectiveness.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    I disagree with the 1st weakness proposed by Reviewer #2.

    Weakness1 from Reviewer #2: The hypothesis is not convincing: I am not convinced that different ratio between foreground (nucleus) and background pixel numbers should be considered as a part of domain gap, because intra ration variation is very common across different categories. For example, different tumor grades in thyroid cytopathology by nature have different ratios, this variation is actually considered as a core factor for downstream analysis. A1: 1) The foreground-background ratio may not be a domain gap between two datasets. However, in my opinion, the ratio can be a domain gap between two small batches. In the manuscript, the batch size is quite small and set to 4. The following cases could happen. When sampling the i-th batch, maybe 3 of 4 samples have high foreground-background ratio. When sampling the (i+1)-th batch, maybe none of 4 samples have high ratio. These cases could lead to variations in the loss and gradients, and affect the stable training. 2) Besides, according to my own experimental experience, the sample with more nuclei usually leads to more large values in the feature maps output by CNNs. Thus, two samples of low and high foreground-background ratio could have different amplitudes in their loss values and gradients. That could be the motivation of the proposed DAIN.

    Weakness2 from Reviewer #2: Generalization ability to different network architecture is poor: A very core step in the success is replacing batch normalization by instance normalization. However, such adaptation is confined to CNN-based network. What if the baseline is a more prevent transformer-based network, where there is no BN (as they have already shown their power in the context of nuclei segmentation)? A2: CNNs and Transformers are two large groups of models. Improving CNNs is significant enough for a conference paper. Further verifying the effectiveness with transformers can be a future work.



Review #2

  • Please describe the contribution of the paper
    1. Argue that domain gaps in terms of nuclei segmentation task can be caused by different foreground (nucleus)- background ratios.
    2. Introduce a re-coloring method and a new instance normalization method to address the color variations and different foreground-background ratio, respectively.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors attempt to explore more factors beyond color variation that can contribute to a domain gap in the context of nuclei segmentation task.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The hypothesis is not convincing: I am not convinced that different ratio between foreground (nucleus) and background pixel numbers should be considered as a part of domain gap, because intra ration variation is very common across different categories. For example, different tumor grades in thyroid cytopathology by nature have different ratios, this variation is actually considered as a core factor for downstream analysis.
    2. Generalization ability to different network architecture is poor: A very core step in the success is replacing batch normalization by instance normalization. However, such adaptation is confined to CNN-based network. What if the baseline is a more prevent transformer-based network, where there is no BN (as they have already shown their power in the context of nuclei segmentation)?
    3. Method section is not self-contained: Sort, AssignValue operations in Alg.1 are not explained.
    4. The authors synthesize a task to verify their concept augment by padding the original testing images with background patches. The performance decline might be attributed to the change of magnification (Here, I am supposing that the author resized the image after padding, otherwise more region of easy-to-segment background should increase the metrics rather than bring down, did I miss something?)
    5. Readability of Sec.2.3 should be improved: The authors should explain the design/motivations/each steps in Alg.2 to improve the readers’ understanding towards the method.
    6. \rho is numerical? Why use binary cross entropy to train?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors claimed will release the codes after acceptance, which might contribute to the reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Discussion on extension to transformer or other variants of UNet.
    2. Method section needs to be self-contained.
    3. The authors should explain the design/motivations/each steps in Alg.2 to improve the readers’ understanding towards the method.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    2

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The fundamental hypothesis that different ratio between foreground (nucleus) and background pixel numbers should be considered as a part of domain gap is not clinically reasonable (weakness 1 in Q6). The corresponding experiments can not support their claim (weakness 4 in Q6).

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    3

  • [Post rebuttal] Please justify your decision

    Thanks for the clarification.

    • Regarding the relation between between domain gap & foreground-background ratio, the author acknowledged that existing databases are quite small, thus I seems like the claim might fail to fit in the most real world settings (by nature those dataset scale is small).
    • Moreover, the statical test although showing the significance between ratio of CoNSeP and DeepLIIF, I believe it’s difficult to regard this significance as a proof of the domain gap.
    • Other factors, such as data unbalance (for example the training set have a higher proportion of tissue structures taking up higher proportion of foreground), can results in the same statistical significance as well as the performance decline.
    • Finally, as the model has more parameters (tab.3), it’s difficult to tell if the improvement is caused by the methodology itself rather than those extra parameters.



Review #3

  • Please describe the contribution of the paper

    The paper studies a Generalizable nucleus segmentation task, which adapts a model trained on a single domain to other domains. The authors observe that domain gaps can also be caused by different foreground (nucleus)-background ratios, as this ratio significantly affects feature statistics that are critical to normalization layers. To address this, a re-coloring method that relieves dramatic image color variations between different domains and a new instance normalization method that is robust to the variation in foreground-background ratios is proposed. Experiments are conducted on 4 datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper initially proposed the overlooked issue in previous studies in DA nucleus segmentation, which is the foreground and background ratio. It presents a distribution-aware instance normalization strategy to address this problem.
    2. To address the color difference between domains, the paper presents a recoloring strategy, which is simple and effective.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The analysis experiment in Tab.1 is not clear. If you use padding to enlarge the background ratio, the image will be resized to a fixed size and fed to network during training. Therefore, each instance will be smaller. I assume that the test data is fixed for all experiments (or the comparison will be meaningless), so the size ratio between instances in the test and training set will change as well. Therefore, could this performance gap caused by the changed size of instances, rather than the number of foreground pixels? I think it is quite easy to validate it, you may add more instances in your padded image to have the same number of foreground pixels, to see if the performance still drops, or your could remove some foreground instances without changing the instance size distribution in the training set, to support your hypothesis.
    • As more parameters are used in DARC, what if we change the segmentation model to a larger one and use other methods that requires no extra parameters.
    • It is recommended to mark the highest numbers with bold or color in your Tab 2, and rank the methods in a performance increasing manner - for easier comparison and enables readers easier to follow.
    • Lacking the performance of RC+IN.
    • Visualization of the recolored image Ir is recommended, to support the statement that the semantic information or fine-grained textures are preserved.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducible based on the paper description.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    NA

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method is simple and effective. More in-depth analysis is required.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This manuscript introduces a domain generalization method for nucleus instance segmentation, by incorporating an image re-coloring module into a U-Net-like network and using distribution-aware instance normalization layers. The proposed method is compared with many other related approaches, and the experiments show that the method can provide superior performance.

    However, the reviewers raised several significant concerns as follows:

    1. Clarify if there are any hyper-parameters for the proposed method (R1). If so, explain how to select proper values?

    2. The hypothesis or assumption of the method, i.e., different ratios between foreground (nucleus) and background pixel numbers should be considered as a part of domain gap, is not clinically reasonable and not convincing (R2).

    3. The experiments to verify the hypothesis is questionable: the model performance degradation in Table 1 might be attributed to the change of magnification (R2 and R3).

    4. The method may has poor generalization ability to different network architectures, such as transformers (R2). R3 also asked for comments on the applicability of the method to other network architectures.

    5. The description of the method, including Algorithms 1 and 2, is not self-contained and needs a significant improvement (R2).

    6. Clarify if \rho is numerical and clearly explain how to apply it to a binary cross entropy loss for model training (R2).

    7. Presentation of the experimental results , e.g., Table 2, also needs an improvement (R3).

    Please consider addressing the comments above in the rebuttal.




Author Feedback

To R1: Q1: Key hyper-parameters. A1: There is only one hyper-parameter, i.e., the momentum factor in DAIN. We set it as 0.01 empirically. We will add ablation study on its value.

To R2: Q1: Relation between domain gap & foreground-background ratio. A1: The key point is that existing databases are quite small. In practice, they cannot cover large range of ratios at all and their ratios can be very different from those in testing. For example, the ratio statistics in CoNSeP and DeepLIIF are 0.16±0.12 and 0.26±0.08, respectively. The p-value (t-test) between these two statistics is 4.24e-9, which is far smaller than 0.01 and means their ratios are statistically different. The difference in ratios affects model performance, as discussed in Sec. 2.3. Therefore, it is a key but often neglected factor causing domain gaps.

Q2: Application to transformers. A2: We greatly appreciate this suggestion. The LN in transformers performs normalization across channels for each individual pixel alone. This is harmful for segmentation as it reduces the difference between foreground/background pixels. Please refer to Section III.B (We believe that…after LayerNorm) of the `Vision Transformers for Single Image Dehazing’ paper, TIP 2023. To justify our claim, we test the UCTransNet model (Wang et al., AAAI’22) with LN and our DAIN, respectively. We train the models on CoNSeP and evaluate them on other databases. The AJI/Dice values are 0.30/0.45 (LN) and 0.34/0.51 (DAIN), respectively. The success on transformers significantly enhances the value of DAIN.

Q4: The experiment in Table 1. A4: There are two things to clarify. First, there is no resizing after padding. Via padding with more background pixels, the foreground-background ratio naturally changes in testing. Second, in inference, the easy-to-segment background does not increase the metrics. We feed the obtained larger image into a trained model. We only count the predictions within the area of the original image, i.e., the padded background area does not contribute to the metrics. More details in changing the ratio. There are 3 steps. First, we remove nuclei in an image via inpainting algorithm. Second, we enlarge the obtained background image via reflection padding. Third, we replace the central region of the enlarged image with the original image. The obtained new image has the same nuclei pixels as those in the original image, but the ratio is changed because it has more background pixels.

Q6: Clarify if \rho is numerical. A6: \rho denotes the foreground-background ratios. It is a vector with two elements for two prediction types: instance contour and segmentation map. The ratios are in the range of [0, 1]; therefore, we adopt the BCE loss.

Finally, we promise to try our best to improve the readability of the paper.

To R3: Q1&A1: Please refer to R2.Q4&A4. Besides, we do not pad images during training. The training data is fixed and the testing data requires padding.

Q2: Performance of a larger model with parameter-free DG methods. A2: Table 3 shows the size of DARC is 5.47M. We enlarge the size of baseline model to 5.6M and 11.5M via adding more residual blocks to its encoder, respectively. We also adopt the parameter-free EFDMix [41] method to the enlarged baseline models. As for the former model, the AJI/DICE scores are 0.35/0.58. As for the latter model, the AJI/DICE scores are 0.34/0.56. According to Table 2, DARC still outperforms these models.

Reply to Q3&Q5: We will revise the paper according to your advice.

Q4: Performance of RC+IN. A4: In Table 2, RC* denotes the model of RC+IN.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This manuscript introduces a generalizable nucleus segmentation method using image re-coloring and distribution-aware instance normalization. This submission received divergent ratings/comments, and the rebuttal has addressed some concerns from the reviewers, such as the applicability of the proposed method to other network architectures and the experimental details for Table 1. In addition, the experiments demonstrate the effectiveness of the proposed method. Although the clarity of the motivation is not fully addressed (e.g., why the foreground-background ratio can be considered a domain gap between two datasets), the strengths outweigh the weaknesses. The authors are encouraged to improve the clarity in the revised version.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors propose a network adaptation/normalization to better deal with different foreground background ratios. The paper had quite divergent reviews after the initial review period, and is also still seen divergently after the rebuttal. The main strengths of the paper include looking at the problem of background/foreground distribution in general as well as the extensive experiments conducted by the authors. Weaknesses of the paper in the initial reviews include the support of the underlying hypothesis and the corresponding empirical evidence, the transferability of the approach to other NN architectures. In general, it is fairly well known that batch normalization can cause issues, especially for small batch sizes - since the batch size is only 4 in this paper, a considerable performance boost could be expected. Still, additional gains are shown for the remaining adaptations of the method. Of note, in their rebuttal, the authors provide additional experimental results to support their claim. This is not ideal as the MICCAI review-rebuttal process does not really allow for a detailed assessment in the context of the paper.

    I agree with R#2 to some extent in that the paper is not watertight with regards to the impact of the foreground/background setting and the additional parameters introduced by DAIN may confound the results; here additional and more thorough experimental investigations would have benefitted the paper. Still, I believe proposed approach provides evidence that there is some added benefit in the method and thus may form an interesting paper for discussion at MICCAI, being slightly above the acceptance threshold.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper presents a new nucleus instance segmentation, by incorporating an image re-coloring module into a U-Net-like network and using distribution-aware instance normalization layers, so it is particular useful when dealing with the varied ratios in foreground and background pixel numbers. Reviewers have quite diverse or somewhat conflicting opinions about the paper, whilst R1 endorse the paper, R2 question the fundamental assumption of the method whether the nucleus density can be considered as a domain gap and generalizability of the method. Author rebuttal did not fully convince R2. In my opinion, this is a niche problem whilst the proposed distribution-aware instance normalization indeed helps whilst databases are so small that cannot cover large range of ratios. In my opinion, this should be first solved by getting more and diverse training data rather than developing a special method to the problem, unless there is something special to lead to limited training data (e.g. a particular fancy microscopy, or a rare disease, etc). To me, there is no particular obstacles to get more diverse training data for nuclei segmentation. As reviewer R2 said, there are always diversity in training data regards tissue structure and one cannot particularly model them in terms of network training. This makes me to question the practical usage of such a method and therefore suggest to reject the paper



back to top