Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Dwarikanath Mahapatra, Antonio Jose Jimeno Yepes, Shiba Kuanar, Sudipta Roy, Behzad Bozorgtabar, Mauricio Reyes, Zongyuan Ge

Abstract

Robustness of medical image classification models is limited by its exposure to the candidate disease classes. Generalized zero shot learning (GZSL) aims at correctly predicting seen and unseen classes and most current GZSL approaches have focused on the single label case. It is common for chest x-rays to be labelled with multiple disease classes. We propose a novel multi-label GZSL approach using: 1) class specific feature disentanglement and 2) semantic relationship between disease labels distilled from BERT models pre-trained on biomedical literature. We learn a dictionary from distilled text embeddings, and leverage them to synthesize feature vectors that are representative of multi-label samples. Compared to existing methods, our approach does not require class attribute vectors, which are an essential part of GZSL methods for natural images but are not available for medical images. Our approach outperforms state of the art GZSL methods for chest xray images.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_26

SharedIt: https://rdcu.be/dnwym

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The article proposes a novel multi-label generalized zero shot learning (GZSL) approach for medical image classification problems, specifically for chest x-rays, where images are commonly labeled with multiple disease classes. The approach uses class-specific feature disentanglement and semantic relationships between disease labels distilled from BERT models pre-trained on biomedical literature to synthesize feature vectors representative of multi-label samples. Unlike existing GZSL methods, the approach does not require class attribute vectors, and it outperforms state-of-the-art GZSL methods for chest x-ray images. The article also includes an ablation study that analyzes the performance of different loss terms.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper proposes a feature disentanglement method to improve feature learning and synthesis in multi-label scenarios by decomposing images into class-specific and class-agnostic features.
    • Text embedding similarities are used to learn semantic relationships between labels, guiding feature generation to preserve multi-label relationships. The paper applies this concept to synthesize unseen class features and perform classification for the GZSL problem.
    • The effectiveness of the proposed model on multi-label and single-label settings is compared on three datasets including NIH, CheXpert and PadChest.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Statistical evaluation of the results would help validate the significance of the proposed method.
    • 5 classes are presented in the tsne to elucidate the minimal overlap between different clusters, specifying which classes were used from the NIH dataset would help evaluate and reproduce the results.
    • The proposed method shows comparable results to the fully supervised FSL (Multi Label) method, further implementation details on this baseline compared with e ML-GZSL in Table 2 would help validate the results.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The datasets used are open datasets, and we encourage making the code public for reproducibility.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The proposed model is evaluated on three datasets including NIH, CheXpert, and PadChest in both multi-label and single-label settings. It would be interesting to compare the method on generic datasets.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper present a novel method with results that seem to outperform current state of the art baselines, however statistical significance evaluation and further clarity on classes used in the tsne could help evaluate the results further.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a generalized zero-shot learning method for multi-label classification of chest x-rays. The proposed method uses class specific feature disentanglement.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The idea of decomposing feature space into class-specific and class-agnostic features is interesting.

    2. Some results are promising.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Major ambiguities are there in the description of the proposed method.

    2. The experimental settings and the evaluation protocol are unclear.

    3. Implementation details are missing.

    4. Experiments are insufficient and comparisons are inadequate. Some recent relevant methods have not been compared.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Major details related to the experimental protocol and implementation are missing.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. The proposed method strongly relies on decomposing the feature space. However, the process of decomposition is not clear. The authors should explain how to divide the latent space into class-specific and class-agnostic vectors. Are these done by two different layers?

    2. The definition of different symbols has not been provided. As a result, the definition of L_{Rec} is not clear to me. Also, the authors have not explained what class-specific autoencoder is. Is it a separate component? The authors should clarify these issues.

    3. The process of obtaining the z-vectors of equation (2) should be explained.

    4. It is not clear how to obtain the average class accuracies in multi-label setting. Furthermore, depending on the definition, average class accuracies may be strongly affected by the performance for the dominant class. Therefore, the authors should also compute performance metrics such as average F1 score and average AUROC.

    5. The authors should provide a detailed layer-wise description of the proposed architecture.

    6. The name of the seen and the unseen classes for each dataset should be stated. It is also necessary to clarify if the experiments have been conducted for different combinations of seen and unseen classes and the number of runs for each experiment.

    7. It is not clear if the same experimental setting has been followed for all competing methods. Also, the details of hyperparameter tuning need to be provided.

    8. Class-wise performances and image results should be presented.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is not clearly presented. The experimental protocol is ambiguous. The results are inadequate.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Propose a feature decoupling method to transform images into known label class features and unknown label class features. 2.Use the known label class features to generate corresponding text features and guide the fusion of the known label class features and unknown label class features, making full use of the unknown label class features of the images for multi-label classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.By fusing the known label class features and unknown label class features of the images, we can make full use of the unknown label class features and improve the accuracy of zero-shot multi-label classification in medical imaging. 2.By using clustering to process all the synthesized features, corresponding label class prototypes are formed.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1.The synthesized features generated by clustering are not further processed to increase the distance between prototypes of different classes. 2.The text feature dictionary generated from known label classes is not fully utilized.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    N/O

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    (1) The proposed approach may have been appropriate at the time of your research, but recent advancements in the field suggest that newer methods could yield more robust results.

    (2) The authors should add some advance and recent work on Multi-Label GZSL Methods.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Some relevant works are missing.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper presents a GZSL method for chest xray images, focusing on feature disentanglement. Three public datasets have been used for evaluation. While the method generally shows good performance, reviewers also noted that there are more recent methods that should be compared with. Also, some details about the experimental studies are missing, which makes it difficult to evaluate the validity of the results.




Author Feedback

We thank all reviewers for their comments.

R1->statistical significance and classes:Student T-tests with benchmark FSL methods show p=0.061 for our ML-GZSL,indicating similar performance to FSL. p<0.04 for other methods indicate significantly different performance. 5 classes presented in tsne are Atelectasis,Consolidation,Effusion,Infiltration and Nodule.

R1->FSL implementation details: We use a DenseNet-121 trained for multi-label classification closely following the top-ranked method for CheXpert (https://arxiv.org/pdf/2012.03173.pdf), where the ranking is based on AUC. At the same time, we report the global Accuracy values to align with previous GZSL literature.

R1->Results on Generic datasets: Our method focuses on multi-label CXR classification and not many public medical image datasets have multi-label use cases. In future work, we will apply our method to other datasets in the single and multi-label settings.

R2->lack of clarity in feature decomposition and z-vectors: We apologize for the ambiguity. https://arxiv.org/abs/2007.00653 describes a similar concept for domain adaptation.Instead of decomposing into shape and texture, we decompose into class-specific and class-agnostic vectors using the final layer.The latent representation (i.e., z-vectors) has two heads (instead of one in a conventional autoencoder) for decomposed features. z is obtained by training class-specific autoencoders (Fig 1a).

R2->Auto encoder definition: L_{Rec} is explained on pg 4 after Eq. (1) as the image reconstruction loss obtained by summing up the losses from all encoders. Class-specific autoencoders refer to autoencoders trained using images from a single class. We will add this clarification.

R2->AUC values: Thanks for the suggestion.Based on global TP,FP,TN,FN we calculate global accuracy.AUC/F1 values for Chexpert data- FSL-93.0/91.7,Ours-92.8/91.6,[17]-91.1/89.6, [9]-84.3.1/82.4. Values for other methods will be given in the revised version. Relative performance for different methods using F1/AUC/Accuracy are similar, showing superiority of our method.Metrics are not affected by the dominant class (“no-findings”) since it is typically not considered in evaluations for unseen class detection.

R2->Architecture description: We use a conventional autoencoder with the following specifications: Input-256x256 image. Encoder – 3 convolution layers (64,32,32 filters of 3x3) each followed by max pooling. Decoder - symmetric to encoder. z^{agn} and z^{spec} are 256-dimension vectors. We will give details in the final manuscript and the code.

R2->clarification on combinations of seen and unseen classes: We thank R2 for pointing this out. Different combinations of 7 seen and unseen classes are taken, and for each combination we run our model 5 times.The final reported numbers are the average of all these different combinations. The same experimental setting (i.e., model runs, combinations of seen and unseen classes, etc.) has been followed for all competing methods.The hyperparameter selection steps are described on pg 7. Additional class wise results may be included subject to space constraints. We report metrics in the same way as previous GZSL papers.

R2+R3->comparison with recent methods: We have compared with recent methods for single label ([7,13,20]- CVPR 2022,[17]-TMI 2022) and multi-label ([9]-ML4H 2021,[10]-CVPR2020,[15]-CVPR2018).We appreciate if specific references to compare can be provided.There are very few works on single-label GZSL for medical images,and hardly any for multi-label GZSL.We refrained from adding computer vision works and focused on medical applications since training dataset sizes, assumptions on class-attribute vectors, and task complexity are different.

R3-> optimal use of synthesized features:We clarify that we aim to increase the distance between prototypes of different classes, as indicated in Eqn 2. The text features have been optimally utilized to get class centroids and synthesize features.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    While the rebuttal has addressed some of the comments, the important questions about experimental setups (e.g., seen and unseen classes) and comparison with more recent methods remain. Especially for the zero-shot scenarios, the results could be quite sensitive to the dataset and settings.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposed multi-label zero-shot classification method for CXR. Most of the comments are about clarification, comparison with baseline and citation of more recent methods. The authors refused to compare against the latest method arguing that the latest methods focus on computer vision applicaitons. While this argument is correct for a method that designed for the domain, I do not find anything medical imaging specific in this application. In other words, the proposed method is an example of a generic approach. Use of t-SNE for embedding is only good for visualization and it should not be viewed a quantititve comaprsion since the t-SNE is very sensitive to the parameters.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes zero shot learning of multi-label data. A strength is that the model integrates knowledge obtained in a large language model trained on medical literature. However, the reviewers raise several concerns regarding the comparison to baselines and details about the experimental studies. In its current form, the work does not meet the level required for MICCAI.



back to top