Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yunlong Zhang, Yuxuan Sun, Honglin Li, Sunyi Zheng, Chenglu Zhu, Lin Yang

Abstract

	When designing a diagnostic model for a clinical application, it is crucial to guarantee the robustness of the model with respect to a wide range of image corruptions. Herein, an easy-to-use benchmark is established to evaluate how deep neural networks perform on corrupted pathology images. Specifically, corrupted images are generated by injecting nine types of common corruptions into validation images. Besides,  two classification and one ranking metrics are designed to evaluate the prediction and confidence performance under corruption. Evaluated on two resulting benchmark datasets, we find that (1) a variety of deep neural network models suffer from a significant accuracy decrease (double the error on clean images) and the unreliable confidence estimation on corrupted images; (2) A low correlation between the validation and test errors while replacing the validation set with our benchmark can increase the correlation. Our codes are available on \url{https://github.com/superjamessyx/robustness_benchmark}. <br><br>

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16434-7_24

SharedIt: https://rdcu.be/cVRrG

Link to the code repository

https://github.com/superjamessyx/robustness_benchmark

Link to the dataset(s)

https://patchcamelyon.grand-challenge.org/


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper propose to synthetically generate corruption to pathology images. Nine type of corruption was considered at five severity levels each. Ten CNNs are trained and their performance are tested on the regular validation set, the corrupted validation set, and a held-out test set. The authors found that 1) the corrupted data leads to higher error rate 2) the model confidence increase with level of corruption severity 3) different corruption affected the models differently 4) early stopping helps with robustness 5) error corrupted validation set is more predictive of generalizability of the model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The experimentation is extensive, and the conclusion from the experiments are mostly useful. It’s interesting to see that the error on the corrupted validation set is a better prediction of generalization. This suggests that the corruption being propose gets the image closer to real test set.
    • The paper is relatively easy to follow.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Crucial detail of how the corruptions are implemented is missing and not available even in the supplement. No promise of code or dataset release is made. This has strong negative impact on reproducibility. This also make it hard for us to judge the true quality of the corruption. While the experimental results seem to suggest that the corruption is working as intended, there could be some unexpected deviation from reality (for e.g., not enough marking included, or deviation from non-random nature of pathologist marking). Releasing code and having more details about the method will improve the strength of this paper considerably.

    • I find each experiment to not be thorough enough. While high level conclusion drawn are interesting, there are many questions left unanswered, leaving that high-level conclusion rather weak. For example, I find correlation to be a weak metric to show that if a model is more robust to corruption, it is more generalizable. It could just be that the corruption creates more out-of-distribution samples, so it happens that the errors on two sets are correlated. I think more investigation is needed. One thing that comes to mind is to train with corrupted images and see if generalization actually improve. In my opinion, this would be a stronger evidence that the proposed corruption is working as intended.

    Another example is the reverse relationship between severity and the model confidence. Isn’t this expected result? As the corruption push the image outside the distribution, the model is likely to make higher confidence negative prediction in this case?

    • There are some discussion, that I think would be better left in supplementary to leave room for more important points (such as one discussed above). In particular, the exact formula for the metrics could be left largely in the supplement. The basic idea is relatively straighforward and do not need to be explained in depth in the main paper, in my opinion.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Poor. No detail on implementation anywhere, and no code to be published.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    See my weakness section.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, I think this is still an interesting paper despite all the weaknesses listed.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    This paper presents two new benchmarks to measure evaluate how deep neural networks perform on corrupted pathology images. Specifically, corrupted images are generated by injecting nine types of common corruptions into validation images. Two classification and one ranking metrics are designed to evaluate the prediction and confidence performance under corruptions. Furthermore, this paper validates the poor robustness of modern CNNs towards input corruptions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper explores an important topic, robustness problem of CNNs, which is significant in the real-world deployment;
    2. The design of corruption types are close to reality;
    3. The experimental results shows poor robustness of modern CNNs is valuable. Another interesting phenomenon is that DNNs have been constantly improved over the past decade, but their classification performance on corrupted pathology images is slightly changed.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Authors claim that the proposed corruptions are easy-to-use in practical settings since they are implemented by being plugged into the dataloader class. Although Sec.3.1 describes how it implements in brief, it is still unclear how it implement and why it is easy-to-use.
    2. The minor mistakes (e.g., the label of y-axis in Fig. 3.) should be corrected.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This paper seems to be simple to realized but still be a bit unclear.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. I suggest that authors publish the code.
    2. I suggest authors further investigate whether the existing robust studies work on the proposed benchmarks.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This study study an significant problem when the model is deployed. The plenty of experiments is convincing.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    I still keep my overall rating with confidence after checking reviews, meta-review, and authors’ responses. Firstly, the promise of publishing codes guarantees reproducibility. Secondly, the authors emphasize the large scale of the datasets in response to the first question of Reviewer #3, which is essential for identifying the problem.



Review #3

  • Please describe the contribution of the paper

    The paper builds an ensemble of corruption methods to be used as a standard suite for evaluating model robustness in histopathology.
    It evaluates the methods on a pretty extensive list of standard architectures, including the more modern vision transformers.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea of the paper is intriguing. It is well-known that model confidence is not a good predictor of probability, especially given out-of-distribution pertubations. I like the idea to run a standard suite of tests on models to evaluate which of them are more robust and/or show reliable confidence estimates.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I don’t understand why a local dataset could serve as a benchmark, given that it was not published.

    As known from domain shift investigations, model robustness towards a covariant shift in the input is highly dependant on training runs. Single training runs, as done in the reported experiments, are thus very likely to not really be conclusive for such investigations.

    The authors write that the AlexNet scores the best in terms of their rCE metric. However, looking at Table 1 I would only infer that the CE metric is similar across all model architectures, so given the formula for the rCE metric it is natural that the worst performing model on clean data achieves the best value here. This, however, does not imply (as the authors motivate) that this metric is a good proxy for model robustness.

    I also don’t agree with the notion „although CNNs are constantly improved in the past decade, their performance on corrupted images changes little while causing the incredibly worse robustness“: Some of these corruptions deteriorate the image severely, and a reduced metric might be highly correlated with a reduced performance by human experts as well. Thus, this is not a sign of weak robustness, but might be related to just information destroyed in the image. Robustness is only within the limits of information being still contained within the image. If the corruption scheme destroys diagnostic information within the image, a reduced performance is to be expected and cannot be attributed to a lack of model robustness.

    The authors state that one of their findings is that overfitting harms the robustness of the models. But that’s actually the very definition of overfitting.

    The paper structure is also a bit unclear. Parts of the experimental results are already reported in the introduction. Further, the authors did not discuss the limitations of their approach in any way.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    No code is given (although the authors state this in the questionaire). The authors relate to one of the datasets as a possible benchmark, yet it is not available and no link is provided. It will thus be hard to reproduce the results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The biggest weakness IMHO is that the authors compare single shot trainings of various architectures, that are hardly comparable. If the authors want to evaluate the robustness of an architecture, then they should IMHO use several training runs of one architecture (and please also report the distribution).

    I would recommend to include a subjective evaluation. If the relevant information in the input image is destroyed, a model deterioration can not be called as missing robustness. Hence, this is only possible if a human expert can still retrieve information from that. I think the authors should try to limit the pertubations to effects that only effect model robustness and not general (e.g. by experienced experts) recognition.

    I would also question the sense of the rCE metric in that sense. If all models have the same (mediocre) results after the corruption of input images, it is not really informative to set that into relation against the original performance.

    The metric that I liked the most was the CEC metric, as it tackles the model confidence. It would be interesting to compare that against a standard metric such as rank correlation.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    2

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the general idea of the paper is good, I think that the methodology is not sufficient to support the claims. It is not surprising that corrupting images severely leads to a reduction in recognition, and that can’t be attributed to model robustness issues. I think that model robustness amidst realistic pertubations is an issue, yet the methodology is insufficient to investigate this, as also reflected by the highly variant results in the paper.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes to benchmark DNN models on corrupted images and investigate the robustness. The reviewers raise concerns about experimental setup and details, results discussion, and the reproducibility since neither code nor the local dataset is provided. We invite the authors to carefully address these concerns in rebuttal.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7




Author Feedback

Dear reviewers, In this rebuttal, we first offer supplements for common questions about reproducibility. Then, the remaining major concerns are clarified point-wise.

[Common question] Q: Reproducibility. A: According to the rebuttal guideline, the external link cannot be provided. Hence, we will publish the code if the paper is accepted. Regretfully, we cannot publish the local dataset due to lack of permission. Nonetheless, the results on the public dataset, PatchCamelyon, can be reproduced according to the code.

[To Reviewer 1] Q: Missing the code make it hard to judge the true quality of the corruption. A: In the manuscript, Fig.1. and Fig.2. give corrupted examples with different types and severity levels. Subsequently, the code and more examples will be published. Moreover, the manuscript also explores the correlation between performance on corrupted data and on unseen data (i.e., test set), which also illustrates the proposed corruptions are close to the reality indirectly. Q: whether the model trained with corrupted images is more generalizable? A: Yes, as your suggestion, training on corrupted images and testing on unseen images reduces the error from 17.00% to 10.53% under the setting of Resnet50 and Patchcamelyon. Q: Reverse relationship between severity and the model confidence is expected result? A: Yes, intuitively, the higher severity level should match the lower prediction confidence. Hence, we design the CEC metric to investigate the robustness of confidence under corruptions. However, results of the CEC show that the existing models have vulnerable confidence under corruptions.

[To Reviewer 2] Q: whether the existing robust studies work on the proposed benchmarks. A: After revealing the poor corruption robustness, how to improve it is essential. In the next step, we will investigate existing studies or propose a new method to improve it.

[To Reviewer 3] Due to space limitation, here we only reply the major comments stressed both in weakness and comments. Q: Model robustness towards a covariant shift in the input is highly dependent on training runs, comparing single shot trainings of various architectures causes results are very likely to not really be conclusive for such investigations. A: The reported results in the manuscript are averaged on three run times but not only one time. Meanwhile, the standard deviations(std) on multiple training runs are small. For example, the std of the Error, CE, rCE, and CEC on AlexNet and LocalTCT are 16.06+/-0.21, 30.64+/-0.52, 1.91+/-0.01, 43.25+/-2.09, and on ResNet and Patchcamelyon are 10.78+/-1.03, 24.13+/-1.47, 2.24+/-0.21, 44.09+/-1.60. The reasons that cause the low std include 1) all models are evaluated on two large multi-center datasets (more than 30,000 validation samples each); 2) all results are averaged on validation set under 45 corruptions (equal to enlarging validation set by 45 times). Moreover, experimental results on two datasets consistently support our claims, so these claims are conclusive for the study of corruption robustness. Q: Whether diagnostic information is destroyed in corrupted images. A: Your comment is a good supplement to the design of corruption. Although the paper only stresses that the generated corruptions are close to true ones, we also consider the remaining diagnostic information in corrupted images when designing corruption type and severity. With your suggestion, 3 images with 45 corruptions (135 images in total) each for two datasets are evaluated by experts. 118 Patchcamelyon and 127 LocalTCT patches can be correctly recognized by experts. Hence, the major corrupted images still keep the diagnostic information. Q: The sense of the rCE metric. A: The rCE is used to measure the performance drop under corruptions. Investigating which components cause the higher rCE is essential for further reducing the CE of the existing powerful models (i.e., the model with the low Error.).

Best regards, Paper1462 Authors




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposes benchmarking DNN models on corrupted images and investigating the robustness. The authors target an important problem which is the model robustness. The topic is meaningful for the field and can inspire more studies toward developing robust models. As the work analyzes and evaluates various benchmarks, detailed settings and code should be provided to ensure reproducibility. It is suggested that the authors include more details about the experiments and publish the source code as promised when preparing the final version.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper presents a benchmarking study comparing various standard CNN models on histopathology image classification for corrupted images. After reading all reviews, rebuttal and the paper, I have to agree with R3 that the paper is much below the MICCAI standard. The lower results on corrupted images are expected and making code public isn’t a sufficient justification for acceptance. Most importantly, with corrupted images, a real contribution would be how to develop a method to enhance the image quality to overcome the corruption. A similar paper was published in MICCAI 2020: Corruption-Robust Enhancement of Deep Neural Networks for Classification of Peripheral Blood Smear Images, which however included method development for improving robustness and hence presented more contribution.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposes to benchmark CNN and transformer models on corrupted images to investigate model robustness and also propose several new metric to access the robustness. All reviewers and meta reviewers have agreed on the motivation of such a study yet have different opinions on the conclusion and ‘add-on’ value of the current paper. I understand the point of R3 and MR3 that a better contribution could be how to develop a method to enhance the image quality to overcome the corruption, but I still acknowledge the contribution of the paper to have a few metrics to assess the corruption problem. Since the authors propose to release the code, it could also help others to assess other models’ performance in terms of corruption and have a more realistic measure of model generalization in real clinical settings.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8



back to top