Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Chundan Xu, Ziqi Wen, Zhiwen Liu, Chuyang Ye

Abstract

Automated cell detection in histopathology images can provide a valuable tool for cancer diagnosis and prognosis, and cell detectors based on deep learning have achieved promising detection performance. However, the stain color variation of histopathology images acquired at different sites can deteriorate the performance of cell detection, where a cell detector trained on a source dataset may not perform well on a different target dataset. Existing methods that address this domain generalization problem perform stain normalization or augmentation during network training. However, such stain transformation performed during network training may still not be optimally representative of the test images from the target domain. Therefore, in this work, given a cell detector that may be trained with or without consideration of domain generalization, we seek to improve domain generalization for cell detection in histopathology images via test-time stain augmentation. Specifically, a histopathology image can be decomposed into the stain color matrix and stain density map, and we transform the test images by mixing their stain color with that of the source domain, so that the mixed images may better resemble the source images or their stain-transformed versions used for training. Since it is difficult to determine the optimal amount of the mixing, we choose to generate a number of transformed test images where the stain color mixing varies. The generated images are fed into the given detector, and the outputs are fused with a robust strategy that suppresses improper stain color mixing. The proposed method was validated on a publicly available dataset that comprises histopathology images acquired at different sites, and the results show that our method can effectively improve the generalization of cell detectors to new domains.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_15

SharedIt: https://rdcu.be/cVRrm

Link to the code repository

N/A

Link to the dataset(s)

https://tupac.grand-challenge.org/Dataset/


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper proposes a test-time stain augmentation (TTSA) method for cell detection in histopathology images, which transforms the test image based on stain mix-up, and then the detection results from different augmented images are fused to produce the final output. The main contribution of this paper is the fusing parts, but an existing method (stain mix-up) [4] is used for stain augmentation. The fusing method is mostly based on the existing test-time ensemble method [3]. Therefore, the contribution is minor.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    *The test-time augmentation and fusing these results are simple and easy to implement.

    *In experimental results, the proposed method improved the detection performance in several methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The technical contribution is minor. Half of the methodological section is explaining the existing work (mix-staining) [4]. The difference from the existing work is using stain mix-up in the test time and fusing the results. However, the reason is unclear, why this mix-up augmentation is good for test images without using training (the reviewer understands the effectiveness of stain mix-up in training, the reason is the same with the mix-up augmentation for semi-supervised general object classification). In test time, it is not always right that the test image is properly transferred to the color distribution of source images. Please clarify it.

    • In the experimental results in table 2, in most cases, \alpha=0 is the best (this is related to the above comments). The reviewer considers that the performance improvement mainly comes from an ensemble of several detection results. It is unclear whether the mix-up augmentation is suitable for the ensemble of cell detection. Even if we used other types of experts (augmentation, network, and data sampling), the performance may be improved. To show the effectiveness of the combination of the stain mix-up and the test-time ensemble approach, it is better to compare another ensemble method (there are many methods for test-time augmentation).

    *In addition, the existing test-time ensemble method for object detection [3] is used for fusing the results.

    *There is no discussion about the existing test-time ensemble methods as related works.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The proposed method basically uses the existing codes and the different part was described, and thus, the reproducibility is fine.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    *This ensemble method may work well when each detection result tends to under estimation (there are false-negatives but few false positives). However, when each detection result tends to over estimation (has false positives), the final results also contain false positives, and thus it is not always right to improve the detection performance. How to control this?

    *TTSA is defined in the caption of Fig.1. It is common that an abbreviation is defined in the main text when it first appears.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As described in the weakness, the idea of test-time augmentation is not novel (the method is mostly same) and the effectiveness of the combination of the stain mix-up and the test-time ensemble was not well evaluated. Therefore, my rating tends to ‘reject’.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The rebuttal addressed some of my concerns. There remains a concern about the effectiveness of the method compared to the other type of TTA. For example, a TTA can change the color of images as augmentation images. If such color augmentation is applied, how are the results? (the proposed mix-up method is better than this?) However, after rebuttal, the reviewer feels the idea of using TTA for domain adaptation is useful for the MICCAI community. Therefore, the reviewer upgraded the rating to ‘weak accept’.



Review #3

  • Please describe the contribution of the paper

    The authors propose a test-time stain augmentation method for cell detection under stain-varying conditions between source and target. The method uses conventional decomposition in the OD domain to decompose RGB images into the stain color matrix and stain density map. Multiple augmented test images are generated by mixing their stain color with the source domain through different weighting factors. The method is validated for cell detection on a publicly available dataset on which it outperforms the existing similar approaches

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Simplicity: The approach is simple and based on the OD space decomposition and generating multiple test augmented images with different weighting factors.

    Flexibility: The method is a generic test-time augmentation approach and can be combined with any trained method during testing. Hence, it can be utilized with different training strategies. The method is also generic because it can be tilted towards source or target with the selection of weighting factors.

    Performance: When used during inference, the approach improves the performance of different training methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limited Novelty: The work is an extension of [4], whose modified version is used during inference instead of training. For modifications, the authors take a representative of the stain color matrix from the source instead of using a random matrix. However, an approach for combining the multiple augmented test image predictions for cell detection is also proposed.

    Limited Application: The method is specifically designed for cell detection and hence is limited from this perspective.

    Limited Analysis: The rationale behind some steps is not clear. The applicability of equations (4) and (5) is not clear. The performance is not reported directly using a random image from the source in (3) similar to [4]. The reason for using Mahalanobis distance is also not justified. It can be replaced with other distance metrics as well.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility response are followed in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I would recommend the following for future work:

    1. The method should be explored for other applications apart from cell detection.

    2. The significance of (4) and (5) should be proved by comparing it with the direct usage of random source image.

    3. If (4) and (5) are necessary, different types of distance metrics should be explored to select the optimal one.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The strong and weak factors for this decision are:

    1. Simplicity, flexibility and performance as discussed in the “strengths”
    2. The limited application, and limited analysis. Certain aspects are not clear due to the limited analysis (see “weaknesses”)
  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Reviewer’s concerns remain unanswered, and hence, original decision is retained.



Review #4

  • Please describe the contribution of the paper

    This paper is concerned with development of stain normalization method aimed for improving performance of cell counting in histopathology images. Thereby, the emphasis is to achieve robustness to domain variation (training set and target set do not come from the same staining protocol). In particular, the authors adopt test-time stain augmentation approach to domain generalization problem. The concept is based on generating mixtures of stain colors of source domain and target domain images. Given detection models fuses mixed images with the property to be less likely affected by the improperly mixed stains. Detection model is not required to be retrained, because it is applied during a test time only. The concept is validated on public dataset related to mitosis detection. In principle, proposed method improved counting performance in terms of A_50 metric (as well as F1 score).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of the proposed method is that the stain normalization model is not required to be trained, i.e. the generation of source-target domain mixtures is executed at a test time only. Furthermore, source domain images can be pre-selected including the case with one source image only. That also can resolve the privacy issue if it exists. Another important strength is that proposed test-time concept is agnostic on detection model as long as it produces bounded boxes and confidence.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of the proposed method is that the whole concept relies on a conjecture that model’s predictions based on various source-target domain mixtures are likely to be suppressed if the mixture ratio is not right. That is based on an assumption that improper mixtures will have poor confidence scores. But that number (set on 3 in the problem considered in the paper) is selected on ad-hoc basis. It is unclear whether the same setting would work for different scenario (dataset). The authors were supposed to perform at least sensitivity analysis with respect to this parameter on the same dataset.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The source code implementing the presented concept is not provided. Implementation details provided in the paper in principle suffice that experienced programmer could implement the concept.

    The hyperparameter of proposed concept is the number of detected cells with high enough confidence scores to be fused together. It set to 3 in the only experiment conducted in the paper. But no justification based on sensitivity analysis is provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The central part of this concept is fusion strategy that operates on the conjecture that model’s predictions based on various source-target domain mixtures are likely to be suppressed if the mixture ratio is not right. Towards that, certain number of detected cells per group are fused based on their confidence scores. That is based on an assumption that improper mixtures will have poor confidence scores.

    The hyperparameter of proposed concept is the number of detected cells with high enough confidence scores to be fused together. It set to 3 in the only experiment conducted in the paper. But no justification based on sensitivity analysis is provided.

    Many sentences are extremely long what occasionally makes paper hard to follow.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is concerned with development of stain normalization method aimed for improving performance of cell counting in histopathology images, whereat tr training set and target set do not come from the same staining protocol. The authors adopt test-time stain augmentation approach to domain generalization problem. Thus, proposed stain normalization method is not required to be trained, i.e. the generation of source-target domain mixtures is executed at a test time only. Furthermore, source domain images can be preselected including the case with one source image only. That also can resolve the privacy issue if it exists. Moreover, another important strength is that proposed test-time concept is agnostic on detection model as long as it produces bounded boxes and confidence. The outlined arguments outweight the potential weakness of proposed method related to a priori (arbitrarily ?) defined hyperparameter of number of detected cells with high confidence ratio.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    I my review I raised concern that the whole concept relies on a conjecture that model’s predictions based on various source-target domain mixtures are likely to be suppressed if the mixture ratio is not right. Towards that, certain number of detected cells per group are fused based on their confidence scores. That is based on an assumption that improper mixtures will have poor confidence scores. But that number (set on 3 in the problem considered in the paper) is again selected on ad hoc basis. It is unclear whether the same setting would work for different scenario (dataset).

    The authors rebuttal was “We would like to clarify that although there is only one dataset in the experiment, the test data involves two different target domains (B & C). For both target domains, the selected K has achieved excellent performance. Thus, we expect that the selected K can be generalized to other scenarios.”

    Thus, the authors still remain in the “speculation domain”. That is why I stay with my original decision “accept - good paper with moderate weakness”




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This manuscript presents a test-time stain augmentation method for generalizable cell detection in histopathology images. The reviewers appreciated that the method is simple, requires no model re-training, and produces improved performance compared with the baseline model. However, the reviewers also raised several significant concerns. R2 pointed out that the method does not exhibit strong technical contributions, because the stain augmentation and the prediction fusion are mainly based on [4] and [3], respectively. In addition, it is not clear why the mix-up augmentation is good for test images without training, and it does not well verify the effectiveness of the combination of the stain mix-up and the test-time ensemble. Finally, a discussion about existing test-time ensemble methods is missing in the paper.

    R3 commented that the technical novelty is limited, and the significance of Equations (4) and (5) are not clear.

    R4 commented that the assumption of the proposed method is not well justified and lack of sensitivity analysis of the key hyperparameter (i.e., the number of detected cells with highest confidence scores) further weakens the paper.

    Please consider addressing the reviewers’ concerns in the rebuttal.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6




Author Feedback

Q1: Reviewer 2 (R2) and Reviewer 3 (R3) suggest that the proposed work is an extension of stain mix-up and test-time augmentation (TTA) and thus the novelty is limited; R2 also suggests discussion on existing TTA methods. Response: In existing domain adaption/generalization methods for cell detection, the domain transformation is performed during training, so that the source and target domains are aligned. However, domain alignment is challenging and the use of it during training alone may be insufficient for unseen test images. Additional alignment may still be needed for a new test image, and this cannot be achieved during training when the test image is unavailable. To address this issue, we propose to further align the test image to the optimal domain of the trained detector at test time. Since this optimal domain is unknown, we assume that it is between the source and target domains and consider a few possibilities, one or more of which are close to the optimal domain. Therefore, compared with stain mix-up that only considers stain transformation for model training, the novelty of our work is to further align the unseen test images to the optimal domain, which is not trivial because the optimal domain is unknown.

Compared with existing TTA methods that perturb the test image and ensemble the results, the motivation of our method is fundamentally different. We generate multiple samples with different stain augmentation because the optimal stain transformation is unknown, and we need to identify the transformation that is potentially useful. This is not an issue for existing TTA methods that are based on, e.g., spatial transformation and random effects. Existing TTA methods are not suitable for our task because their operations are still in the native color domain. We will add this discussion on the difference between our method and existing TTA methods.

Note that we manage to achieve our design with existing tools mostly, which allows convenient implementation of our method, but the overall motivation and idea are substantially different from and improve upon existing methods.

Q2: R2 suggests that test-time stain augmentation is unnecessary and the improvement is due to ensembling, because alpha=0 is the best in most cases. Response: We would like to clarify that alpha=0 corresponds to a full stain transformation of the test image to the source domain, whereas alpha=1 corresponds to the original test image (see Eq. (3)). Based on the results in Table 2, when no stain augmentation is used (alpha=1), the detection performance is rather poor, and simply ensembling the results without stain augmentation would also lead to poor detection performance. Thus, the joint use of test-time stain augmentation and fusion is necessary.

Q3: Reviewer 4 (R4) suggests that the assumption of the proposed method needs further justification, as with only one experiment it is unclear if the selected hyperparameter K will work for other scenarios. Response: We would like to clarify that although there is only one dataset in the experiment, the test data involves two different target domains (B & C). For both target domains, the selected K has achieved excellent performance. Thus, we expect that the selected K can be generalized to other scenarios.

Q4: R3 is interested in the use of a random image for test-time stain augmentation instead of Eqs. (4) and (5). Response: The use of a random image at test time for mixing would lead to highly variable performance, depending on which image is selected. Therefore, we choose a representative image for test-time stain augmentation.

Q5: R3 expects more application of our method besides cell detection. Response: The augmentation part of our method is agnostic to the application, and only the fusion part needs slight modification for other tasks. Thus, we believe that our method can be easily extended to other tasks.

Code will be shared after publication. Other minor issues will also be addressed.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper introduces a test-time stain augmentation method to improve the generalization of a trained cell detector. The key strength is that the method does not need to re-train the detector for new testing images. Although the method is not compared with other test-time data augmentation methods and is evaluated using one dataset, the reviewers gave positive comments on the idea of the test-time augmentation for domain adaptation in medical image computing and the improved performance compared with the baseline model. After the rebuttal, all the reviewers recommend acceptance of the paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    All of the reviewers mention that this paper has very limited novelty. In the rebuttal the authors further clarify that their approach does differ from others but this is still an incremental improvement. The use of stain separation for augmentation is not new and the use of test time augmentation is also not new – for example in the MiDOG challenge I believe that most of the top teams used some form of test time augmentation. I’m not sure if stain mix-up has been used in this context before but it is a well known method already. There is definitely a place in the literature for a paper presenting a through evaluation of different test-time augmentation strategies and a proper comparison with train time augmentation however the reviewers comments suggest that this paper lacks some ablation studies needed. This paper is neither sufficiently novel nor sufficiently “impactful” to be accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    12



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper propose a test-time colour augmentation method based on original vahadane or stain mix up stain normalization method. The major concern of reviewers and metareviewers is the amount of novelty in the paper. Since the test-time augmentation method itself is based on existing methods, there is not too much amount of novelty. Although authors has argue in the rebuttal that their method is methodologically different from more simpler/or more frequently-used colour jittering test-time augmentation method. But it does not certainly means that proposed method is superior in performance since there is no such a comparison in thee paper. Additionally, the method is based on vahadane’s sparse stain separation and could be time-consuming, which can affect the practical usage of such a test-time augmentation.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10



Meta-review #4

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    AC recommendations on this paper were split with a majority vote of “rejection”, while the reviewers expressed consensus in supporting acceptance after rebuttal. The PCs thus assessed the paper reviews, meta-reviews, the rebuttal, and the submission. Although there was lingering issues that can be addressed, the paper has merits in simplicity, flexibility, and performance which were appreciated by the reviewers. While all reviewers suggested areas for future improvements, all of them maintained support for the paper after reviewing the rebuttal – one reviewer raised score to support acceptance. The PCs agree with the convincing arguments of the reviewers and AC, and thus the final decision of the paper is accept.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



back to top