Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Gregory Holste, Ziyu Jiang, Ajay Jaiswal, Maria Hanna, Shlomo Minkowitz, Alan C. Legasto, Joanna G. Escalon, Sharon Steinberger, Mark Bittman, Thomas C. Shen, Ying Ding, Ronald M. Summers, George Shih, Yifan Peng, Zhangyang Wang

Abstract

Pruning has emerged as a powerful technique for compressing deep neural networks, reducing memory usage and inference time without significantly affecting overall performance. However, the nuanced ways in which pruning impacts model behavior are not well understood, particularly for long-tailed, multi-label datasets commonly found in clinical settings. This knowledge gap could have dangerous implications when deploying a pruned model for diagnosis, where unexpected model behavior could impact patient well-being. To fill this gap, we perform the first analysis of pruning’s effect on neural networks trained to diagnose thorax diseases from chest X-rays (CXRs). On two large CXR datasets, we examine which diseases are most affected by pruning and characterize class “forgettability” based on disease frequency and co-occurrence behavior. Further, we identify individual CXRs where uncompressed and heavily pruned models disagree, known as pruning-identified exemplars (PIEs), and conduct a human reader study to evaluate their unifying qualities. We find that radiologists perceive PIEs as having more label noise, lower image quality, and higher diagnosis difficulty. This work represents a first step toward understanding the impact of pruning on model behavior in deep long-tailed, multi-label medical image classification. All code, model weights, and data access instructions can be found at https://github.com/VITA-Group/PruneCXR.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_64

SharedIt: https://rdcu.be/dnwIa

Link to the code repository

https://github.com/VITA-Group/PruneCXR

Link to the dataset(s)

https://nihcc.app.box.com/v/ChestXray-NIHCC

https://physionet.org/content/mimic-cxr/2.0.0/


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper studies the effects of model pruning on a relevant long-tailed multiclass medical classification problem. Evaluation methods inspired by the work of Hooker et al. ref.[13] are applied here to CXR diagnosis. While pruning can also entail beneficial effects a cost-benefit balance has not yet been well investigated in cases characterized by severe class imbalance or class co-occurrence.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Findings indicate that, in general, rare diseases are forgotten earlier and are more severely impacted at high sparsity, while the more two diseases co-occur, the more similar their forgetting trajectories are across all sparsity ratios. radiologists perceived that where an uncompressed and heavily pruned model disagree, the CXRs had more label noise, lower image quality, and higher diagnosis difficulty.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Only a specific DNN and a baseline pruning approach have been tested. Maybe slightly different or richer observations and conclusions could derive from an extended study.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    authors declare labels and data splits will be made public upon acceptance

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    I’ve no specific comments. The paper is clear enough.In case additional pages are allowed I would expand the methodological description which is a bit essential.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The study findings are interesting and contribute to improve the general understanding of deep learning models in cost-convenience of pruning is to be evaluated in deployment scenarios.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors present an in-depth study on the effects of pruning on long-tailed and multi-label medical image classifiers. In detail, the study reports a thorough statistical analysis on the performance loss of a ResNet-50 at various L1 pruning percentages, i.e., from 0% up to 95% of the model, with 0.05% increments, for a total of 20 different pruned models. Furthermore, to ensure a statistical valence, the authors perform this evaluation 30 times with unique random initializations, thus resulting in a total of 60 evaluations. Differently from existing literature, and in particular the work by Hooker et at. [13], the authors present an analysis of medical images (XCR), which can be beneficial to understand which diseases tend to be forgotten by a model more quickly and might help the development and design of future models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Clear experimentation explanation: the authors present their study extremely clearly and provide many insights and observations on the obtained results, which are particularly useful in statistical analysis. In detail, the authors unfold the effects of pruning on the classification of (rare) diseases by answering four questions on this procedure and manage to find: (i) sparsity thresholds after which there is a consistent performance drop, helpful when pruning a model; (ii) indication that class frequency directly affects the performance after pruning, which is an expected result but worth being confirmed; (iii) a correlation between class frequency and co-occurrences with respect to what is defined as “forgettability” behavior that can provide cues for improved model designs; (iv) a definition of PIE applied to medical images that results in images containing multiple and(or) rare diseases, which can be used to obtain insight into new data collections. Insightful graphics: the reported figures clearly show the effects of pruning through several statistics and allow the authors to provide relevant observations on the analyzed dataset. For instance, the authors highlighted differences between PIE and non-PIE images. Validation by expert cohort: concludes the study a validation performed by a cohort of experts that provide an opinion on the labels, image quality, and diagnosis difficulty on PIE vs. non-PIE images that clearly show their differences.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Single case study: although the authors mention these aspects in their work, as they are important flaws, all experiments were performed using a single model (i.e., ResNet50) and a single pruning strategy (i.e., L1). These are the main weaknesses in a well-thought statistical analysis since it is possible that the results could change entirely by exploring different pruning strategies or models. Regardless of the general implications such experiments might have, the performed tests remain sound and valid. However, they may be correct only for these specific settings, indicating that further inquiries are certainly required.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors disclose all parameters to generate their statistical study. While the dataset split used for their experimentation is missing to fully reproduce their results, it is mentioned in the paper that it would be made available upon acceptance.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The authors presented an in-depth statistical analysis of the effects of pruning for long-tailed multi-class medical image classification. While easy to follow and full of insights, there are some aspects that can still be improved. For instance, the first equation could be simplified and made more apparent as it accounts for the sparsity ratios k which are instead expanded in it. Specifically, the forgettability curve can be parametrized using the already defined parameters (i.e., k and i). Moreover, since experiments are also performed without pruning, and the forgettability curve requires the AP of such a model to be computed, k should contain the 0 value as well. Finally, equations should be numbered so it is easier to follow them through the presentation.

    As possible future work, especially since it was also stated as a limitation, it would be interesting to explore the effects of different pruning strategies on higher-performing models. For example, if a model starts from higher AP than those reported (i.e., ResNet50 results), would they reach higher pruning thresholds, or is the performance degradation strictly tied to the architecture? If so, are some structures more robust than others to pruning? These are only some of the possible questions that would provide further information on the pruning strategy.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors present a statistical study embellished with many insights and observations. However, the obtained results refer to the specific settings described in the manuscript, which does not allow the formulation of more general observations without further experimentation. Regardless, the work has its merits as it shows relevant results. Moreover, the authors mention the issue mentioned above and provide their insights fully conscious of this problem, set to be addressed in the future.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper analyses the pruning’s effect on neural networks trained to diagnose thorax diseases from chest X-rays (CXRs). The goal is to understand the differential impact of pruning on model behavior in the unique setting of deep long- tailed, multi-label medical image classification. 


  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Very well written paper. Results are very clearly communicated.
    • The authors designed a study for relevant questions for pruning and also included a human study to compare it with their findings.
    • Research questions are relevant and satisfactorily answered. 

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Don’t think that ResNet50 is a good model choice for the work. For CXR mostly DenseNet121 or relatives are used.
    • Also the study only includes a single model. It would be much more beneficial to include other model architectures. 

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Ok.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    See weaknesses above.

    Furthermore: Intro: pruning for fairness: this paper is suitable for pruning for fairness, since the authors specifically work with CXR: https://proceedings.mlr.press/v182/marcinkevics22a.html

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is very well studied and works on a relevant question.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This is a solid paper with a new study. The paper reports a study to investigate the effects of pruning on the performance of classifiers for thoracic diseases in chest radiographs.




Author Feedback

We thank the Reviewers and Meta-Reviewer for their time and constructive feedback on our submission.

General Comments We are grateful that all Reviewers agree that this paper offers meaningful insights into the impact of pruning on medical image classifiers. Reviewers also acknowledge that our findings might be strengthened by studying multiple models and pruning methods, as we explicitly note in our Discussion section. We fully agree that expanding this study to include other model architectures, training methods, model compression methods (even beyond pruning), and different tasks and modalities beyond chest X-ray classification would enhance the applicability of our findings. Given the space constraints posed by the proceedings template, our aim was to conduct a comprehensive, fine-grained analysis of one popular pruning method and model architecture. While exploring how these findings vary across models and compression techniques was not our primary objective, they are undoubtedly intriguing questions worthy of future study. We would like to emphasize that all code, data (including new labels), and trained models will be publicly released in the coming months, ensuring that others in the community can easily reproduce and expand upon our work.

Reviewer #1 In the camera-ready version, we will add more details to the Methods section. In addition, while many key implementation details can be found in the Supplementary Materials, we will try to incorporate them into the main text.

Reviewer #2 We greatly appreciate the concrete suggestions to (i) more concisely express the forgettability curve, (ii) number our equations, and (iii) include 0 in our list of sparsity ratios. We will make all proposed changes in the camera-ready version.

Regarding how forgettability curves might vary with model architectures more performant (i.e., likely larger) than a ResNet50, we believe this question can be partially answered by referring to the existing literature on sparsity. There is a well-studied correlation between “compressibility” and model capacity – in general, larger and more expressive models can be pruned more “aggressively” than smaller ones [1,2]. However, we agree that it would be interesting to empirically verify these findings and further explore how the relationship to the multi-label and long-tailed nature of our problem might vary with model capacity.

Reviewer #3 We appreciate the suggestion to add [3] to our Introduction and will update this in our camera-ready version.

References [1] Li, Zhuohan, et al. “Train big, then compress: Rethinking model size for efficient training and inference of transformers.” International Conference on machine learning. PMLR, 2020. [2] Zhu, Michael, and Suyog Gupta. “To prune, or not to prune: exploring the efficacy of pruning for model compression.” arXiv preprint arXiv:1710.01878 (2017). [3] Marcinkevics, Ricards, Ece Ozkan, and Julia E. Vogt. “Debiasing deep chest x-ray classifiers using intra-and post-processing methods.” Machine Learning for Healthcare Conference. PMLR, 2022.



back to top