Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Marin Scalbert, Maria Vakalopoulou, Florent Couzinié-Devy

Abstract

Histopathology whole slide images (WSIs) can reveal significant inter-hospital variability such as illumination, color or optical artifacts. These variations, caused by the use of different protocols across medical centers (staining, scanner), can strongly harm algorithms generalization on unseen protocols. This motivates the development of new methods to limit such loss of generalization. In this paper, to enhance robustness on unseen target protocols, we propose a new test-time data augmentation based on multi domain image-to-image translation. It allows to project images from unseen protocol into each source domain before classifying them and ensembling the predictions. This test-time augmentation method results in a significant boost of performances for domain generalization. To demonstrate its effectiveness, our method has been evaluated on two different histopathology tasks where it outperforms conventional domain generalization, standard H&E specific color augmentation/normalization and standard test-time augmentation techniques. Our code is publicly available at https://gitlab.com/vitadx/articles/test-time-i2i-translation-ensembling.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_12

SharedIt: https://rdcu.be/cVRrb

Link to the code repository

https://gitlab.com/vitadx/articles/test-time-i2i-translation-ensembling

Link to the dataset(s)

https://wilds.stanford.edu/datasets/

https://zenodo.org/record/53169#.Yr3AIdIzYUE

https://zenodo.org/record/1214456#.Yr3AUdIzYUE

https://warwick.ac.uk/fac/cross_fac/tia/data/crc-tp


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper proposed a new test-time data augmentation based on multi-domain image-to-image translation, which is achieved based on a StarGanV2. The proposed network achieves better performances on 3 public datasets compared with other SOTA methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Very intuitive idea and the proposed algorithm gets a good boost
    2. Experiments were conducted using multiple seeds on large enough datasets while comparing many mainstream ood generalization methods
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Although there is a good improvement, however, the idea’s novelty is still somewhat lacking. A similar idea can be found in https://proceedings.neurips.cc/paper/2021/file/a8f12d9486cbcc2fe0cfc5352011ad35-Paper.pdf. They achieved 94.8% on camelyon17.
    2. When comparing with other TTA methods, can the authors provide a forward count/test time comparison? The performance increases while the time required by the authors’ method may increase many times, which is fatal for WSI images
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Very clear description, can be reproduced

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. The author should compare with https://proceedings.neurips.cc/paper/2021/file/a8f12d9486cbcc2fe0cfc5352011ad35-Paper.pdf, and discuss their difference.
    2. The authors should provide any compared metrics like forwarding counts or test time when compared with other top methods.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of the paper

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The focus of this study is to improve the model generalization. To this end, test time image augmentation is performed based on multi-domain image translation. During test time, the proposed pipeline transforms an input image into a number of invariants with a multiple domain-specific styles. All these invariants are then passed through a classifier to obtain their respective class predictions. The final prediction to the input image is decided based on the ensembling strategy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The application of generative models at test time is a novel idea for improving the model’s performance.
    • The proposed approach is evaluated for two different tasks and compared with a number of other approaches.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The manuscript could be improved further and needs clarity and details on various points (see point 8).
    • The proposed approach may not scale well with an increase in a number of domains. Getting predictions against a large number of invariants in order to get a final prediction for an input image would be time consuming.
    • It is not clear if separate discriminator and classifier are trained for each domain style. If so, it could be a bottleneck when there are many domains.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    No links or references for the source code and dataset, however the authors have stated in the reproducibility form that they will release it upon acceptance of the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • I would suggest adding details of how mapping between different domains was learned with a simultaneous classification task. The inference pipeline is clear, however training is not.
    • Page 2 “UDA methods needs unlabeled data from the target domain”. Do you mean availability of domain labels?
    • Please add reference in the last sentence of the last para of Introduction section.
    • Section 2.1: a random latent latent code -> redundant latent
    • Add details on training the model. Due to page limits, this could be added to the supplementary doc.
    • Give reference to Supplemenaty Figure 2 and Table 1 in the main manuscript.
    • Add details of baseline method. Also, it is not clear what “Base + XYZ method” (Table 1) really means.
    • What are those values in Table 1.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • There are various points that need clarity.
    • The proposed approach is not scalebale.
  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #4

  • Please describe the contribution of the paper

    The authors propose a test-time augmentation (TTA) strategy based on StarGAN-V2 to specifically improve the domain generalization ability for histology images. This method is intended to overcome the stain variations across different data centers, which is a very common challenge in histology analysis. Three kinds of ensembling methods have been proposed to enhance the TTA.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Domain gap is very practical and challenge challenges for almost any multi-center histology, due to stain variations. This paper provides a new strategy (i.e., test-time augmentation) in addition to the previous stain normalization/augmentation for addressing this issue, where the previous methods are training-time based or prior to the training time.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Efficiency is also important for medical image especially for the inference stage. Yet, execution time/computational cost are not compared. 2) No comparisons with any stain normalization approaches, which are mostly used in histology to overcome the stain variations. 3) To make the comparison fair for evaluating the performance of StarGAN-V2 (which is a stain augmentation approach), comparing with (i) TTA + Stain Normalization; (ii) TTA + Stain Augmentation [ref. 21], which are much more efficient histology-specific TTA approach, are missing. These approaches needs no extra computational cost to train a StarGAN but shows very good empirical results from my experiments (which is a very effective and efficient techniques for competition).

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Probably fair. But it would be better if the authors could release their codes.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Execution time comparison, and performance comparison with (i) TTA + Stain Normalization; (ii) TTA + Stain Augmentation [ref. 21] are suggested (please see weakness for more details as I am not going to repeat.)

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Lack of proper experimental comparisons with simple but efficient competitive methods (see weakness #3)

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper received mixed comments. Reviewers acknowledge the new ideas of handing domain shift in histopathology image and good performance.

    However, reviewers also raise concerns about the time complexity of the proposed framework. Thus, the authors are invited for a rebuttal to address the reviewer’s concern. Especifically, the authors should pay more attention to following points.

    1. Disucss and clarify the time complexity (training/testing time) of the proposed framework and the scalability of the proposed framework.
    2. Lack a comparison with stain normalization approaches.
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4




Author Feedback

We would like to thank the reviewers and the AC for their constructive comments that helped to improve our article. The main concerns raised by the reviewers and the AC are addressed in the following paragraphs. Suggested modifications will be included to the article or the supplementary material.

  1. Clarification of training and inference pipelines (R3): (1) A StarGanV2 is trained on the source domains (shared discriminator backbone and domain specific heads enable training on a large number of domains) (2) A classifier is trained on the source domains using the trained StarGanV2 for train-time data augmentation. (3) At test-time, each test image is projected onto each source domain using the trained StarGanV2, the different images are classified and predictions are aggregated.

  2. Comparison with stain normalization methods (R4 and AC): we provide the accuracy on the Camelyon17 WILDS dataset with a standard stain normalization method (Macenko) and with a H&E color jitter method. These methods are also combined with test-time augmentation methods (TTA) to evaluate the gains induced by the TTA.

methods val test avg
Macenko 81.6 92.5 87
Macenko + geometric TTA 81.9 93.1 87.5
H&E color jitter 91.8 84.8 88.3
H&E color jitter + TTA 90.3 84.3 87.3
Proposed method 92.8 94.0 93.4

Stain normalization and H&E color jitter have good performances on one domain (val or test) but not the other. When combined with TTA methods the accuracy improves only marginally or even decreases (H&E color jitter + TTA). Our method has the best performances on both val and test domains and is robust on both domains. It makes sense: stain normalization + geometric TTA is not so useful since each generated image is projected to the same color; while both our TTA method and TTA H&E color jittering are based on several stain augmentations, ours is not random, it projects the test image towards the source domains leading to better generalization.

  1. Time complexity and scalability of the approach (R2, R3, R4 and AC): on the table below, we provide inference times per batch (in milliseconds) for different TTA methods on images from the Camelyon WILDS dataset with a batch size equals to 32 and 10 domains.
No TTA 15.7
Geometric TTA 69.7
H&E Color Jitter TTA 144
StarGanV2 TTA 667
Lighter StarGanV2 TTA 105.6

Geometric and H&E Color Jitter TTA are indeed faster: our method is approximatively and respectively 10 and 4 times slower. However our TTA leads to large performances gains. Indeed, from the paper Tab 1, when comparing our method with or without TTA, we can see that our TTA method lead to +3.2% on Camelyon17 val, +17.6% on Camelyon17 test, +6.6% on Kather19, +16.7% on CRCTP.

Our method can indeed suffer from scalability issues. However, they can be mitigated:

  • As can be seen on the 1st plot in Fig 3, the TTA can be done on a subset of domains without sacrificing much performance.
  • We used a heavy generator in the StarGanV2 which could be replaced by a lighter and more modern one without degrading image translation quality and without changing the proposed method (lighter StarGanV2 above, 6 times faster).
  • Finally, grouping the WSIs based on their protocol (medical centers, scanners) could considerably reduce the number of domains.
  1. Missing reference (R2). As pointed by R2, ‘MBDG’ uses a similar approach. Indeed, they also exploit a multi-domain image-to-image translation model (MUNIT) trained on the source domains to perform train-time data augmentation. However, in our method, it is also exploited for test-time augmentation. That is why, it is complementary to their method but also to any other DG or Train-time data augmentations presented in the paper Tab 1.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal has addressed the concerns about comparison with other baseline methods. Although the time complexity is a potential drawback of the proposed method, while as discussed in the rebuttal, it can be alleviated by some techniques, like using a lightweight network architecture.

    The AC think this paper provides a good basis for investigating out-of-distribution generalization methods in histopathology.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors provide additional information in the rebuttal that addresses some of the reviewers concerns but the worry that the GAN approach increases processing time significantly is a valid one, particularly since the test time performance of Macenko approaches seem to be very close to those of proposed method.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    14



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper presents a StarGanV2-based test-time augmentation method to improve model generalization. The method can produce good results in downstream classification tasks. The rebuttal has addressed the reviewers’ major concerns, such as the method’s time complexity and scalability as well as a comparison with stain normalization methods.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8



back to top