Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yiming Lei, Zilong Li, Yan Shen, Junping Zhang, Hongming Shan

Abstract

Lung nodule malignancy prediction has been enhanced by advanced deep-learning techniques and effective tricks. Nevertheless, current methods are mainly trained with cross-entropy loss using one-hot categorical labels, which results in difficulty in distinguishing those nodules with closer progression labels. Interestingly, we observe that clinical text information annotated by radiologists provides us with discriminative knowledge to identify challenging samples. Drawing on the capability of the contrastive language-image pre-training (CLIP) model to learn generalized visual representations from text annotations, in this paper, we propose CLIP-Lung, a textual knowledge-guided framework for lung nodule malignancy prediction. First, CLIP-Lung introduces both class and attribute annotations into the training of the lung nodule classifier without any additional overheads in inference. Second, we design a channel-wise conditional prompt (CCP) module to establish consistent relationships between learnable context prompts and specific feature maps. Third, we align image features with both class and attribute features via contrastive learning, rectifying false positives and false negatives in latent space. Experimental results on the benchmark LIDC-IDRI dataset demonstrate the superiority of CLIP-Lung, in both classification performance and interpretability of attention maps. Source code is available at https://github.com/ymLeiFDU/CLIP-Lung.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43990-2_38

SharedIt: https://rdcu.be/dnwLT

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    They proposed CLIP-Lung, a framework using textual knowledge as guidance for effective lung nodule malignancy prediction.

    According to the research, the clinical text information annotated by radiologists dramatically contributes to the differentiation of challenging samples with labels that are close to each other. Most lung nodule malignancy prediction methods have cross-entropy loss functions using one-hot categorical labels. However, it results in difficulty in distinguishing those nodules that have less distance between progression labels.

    Based on this information, CLIP-Lung has three stages: CLIP-Lung uses both class and attribute annotations. The model has a channel-wise conditional prompt(CCP) module. It adjusts the image features using both class and attribute features.

    CLIP-Lung is tested on the LIDC-IDRI dataset and obtained better results in terms of both classification performance and interpretability of attention maps.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The strongest part of the paper is definitely that this paper is very well-designed and organized. Since the terms used in the paper are well explained and illustrated, even people not well-versed in this subject can easily understand and learn the method while reading the paper.

    The proposed method has a novel architecture and an excellent structure that combines radiological images and textual data, which has not been explored before. The designed method is very logical. Having not only the images of nodules in as an input but also adding the text information had an excellent impact on model progression and resulted in high accuracy. This kind of new method is open to development.

    Another strength is that during the inferencing part, there are lots of detailed comparisons between dataset and loss functions with many models. Experiments were carried out many times with several data with different types of classes, which are benign, malignant, and unsure of classification tasks. The differences between them were recorded in detail. For instance, while all three classes were present in the training, testing and validation of data, another sub-dataset combination of the classes was contained, and experiments were done by differentiating these samples. In this way, many different test cases were created, and effective test results were obtained.

    The illustrated explanations and examples were clear and comprehensible as the sequence of information was given to the reader in pieces. Also, the narrative sequence was organized to be read step-by-step. Therefore, readers can understand even without being informed in this field.

    Nevertheless, since the model has three stages, it could be complicated to understand. However, as the similar reasons mentioned above, processes are coherent and intelligible. The formulas are well-written and explained in great detail.

    Furthermore, three different loss functions are used within different combinations during the testing. Consequently, the effects of a single or the combinations of loss functions could be seen clearly.

    CLIP-Lung has three stages: CLIP-Lung uses both class and attribute annotations during the training of the lung nodule classifier. To make proper connections between learnable context prompts and the specific feature maps, the model has a channel-wise conditional prompt (CCP) module. It adjusts the image features using class and attribute features with contrastive learning and rectifies false positives and negatives. Based on this information, explaining all these pieces of information rather than declaring the whole situation is very effective in understanding the processes and is easy to comprehend.

    Lastly, the method is open for new development and has a high potential as it extends to a new branch of the visual text information field.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The inferencing part has only one dataset, which is LIDC-IDRI. Including some other datasets might better impact seeing the different aspects of tests. More information could have been included in the abstract section covering test results in the numbers to get more information about the model’s effectiveness.

    While the paper uses textual knowledge to enhance prediction accuracy, it needs to explain how this knowledge is integrated into the model, making it difficult for other researchers to reproduce or adapt the approach. For instance, instead of directly writing text attributes, it should have been included why these attributes effectively identify nodules. Also, these examples could be explained more along with the meaning of attributes like what is speculation or subtlety and why they are effective in determining the type of classes.

    The paper implies that the contrastive learning model can effectively shorten the distances between positive pairs and increase the distances between negative ones. It should be explained more clearly to understand why the distance between positive and negative pairs is critical.

    Other parameters and metrics related to training and inferencing could be given, except for accuracy, recall and F1 score.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Since the data is public and the test is performed with only one dataset, tests can be reimplemented and diversified by adding different datasets. Furthermore, this model can be developed and varied using different text attributes. Since information or data sources can be extended and varied by making visual representations from text annotations, similar techniques can be developed to be used in different fields.

    However, since the model code is private, this will significantly affect the reproducibility or reusability of the methods. It will be challenging to produce all those identical inferences since they tested the model with too many subsets of the same data and all combinations of different loss functions.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The paper proposes a new CLIP-Lung framework for effective lung nodule malignancy prediction. The framework uses textual knowledge as guidance, annotated by radiologists and includes attributes like speculation and subtlety. The authors state that most lung nodule malignancy prediction methods use cross-entropy loss functions with one-hot categorical labels. However, this results in difficulty distinguishing nodules that have less distance between progression labels. To address this, CLIP-Lung has three stages. First, it uses both class and attribute annotations. Second, the model has a channel-wise conditional prompt (CCP) module to connect learnable context prompts with specific feature maps. Third, it adjusts image features using class and attribute features with contrastive learning to rectify false positives and negatives. CLIP-Lung is tested on the LIDC-IDRI dataset and compared to other methods. The results show that CLIP-Lung performs better in terms of both classification performance and interpretability of attention maps.

    The paper is well-organized and well-explained, making it easy for readers to understand the method, even if unfamiliar with the field.

    One of the paper’s strengths is the detailed comparisons between different datasets, loss functions, and models during inferencing. Experiments were carried out many times with several types of data with different classes, which are benign, malignant, and unsure. Their differences were recorded in detail, and practical test results were obtained. The explanations and examples were clear and comprehensible, as the sequence of information was given to the reader in pieces, and the narrative sequence was organized to be read step-by-step.

    While the formulas are well-written and explained in detail, providing additional information could make the process more understandable. For instance, the paper implies that the contrastive learning model can effectively shorten the distances between positive pairs and increase the distances between negative ones. However, since the model has three stages, it could be complicated to understand. It would be helpful to explain more clearly why the distance between positive and negative pairs is essential. Additionally, including more parameters and metrics related to training and inferencing, beyond just accuracy, recall, and F1 score, would provide a complete understanding of the method’s effectiveness.

    This method is open to development and has a high potential as it extends to a new branch of the visual text information field. The model’s effectiveness can be further increased by adding new and different text attributes and having more diverse textual knowledge. Ultimately, the model could suggest the most appropriate terms for clinical evaluations while finding the classes that nodules belong to.

    However, one limitation of the paper is that the inferencing part only uses one dataset, LIDC-IDRI. Including other datasets better explains the model’s effectiveness in different contexts. The paper could also benefit from more information in the abstract section covering test results in numbers. Instead of directly writing text attributes, including why these attributes effectively identify nodules would be helpful. Additionally, these examples could be explained more in-depth, along with the meaning of attributes like spiculation and subtlety and why they effectively determine the type of classes.

    Finally, the authors should have made the model code publicly available. This limits the reproducibility and reusability of the method. However, while the data is public, and the tests can be reimplemented and diversified by adding different datasets, it will be challenging to reproduce exact inferences since they tested the model with many different subsets of the same data and all combinations of different loss functions.

    In conclusion, CLIP-Lung is a promising framework for lung nodule malignancy prediction that has the potential to be further developed and improved. The paper is well-designed and organized, making it easy to understand, and the experiments are comprehensive. However, some aspects of the paper could be explained more clearly, and the lack of publicly available code may impact its reproducibility and reusability.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although it may seem like a negative aspect that some of the terms used need to be explained in detail, it will not be challenging to fully understand the paper for people who are well informed about the subject. As the methods are being read and followed, the fact that the explanations and examples are in the correct order significantly increases the comprehensibility of the concept. Therefore, this paper is well organized.

    The formulas are written clearly, and the parameters are clearly defined. Visual explanations like graphs, tables, and images were clear and understandable. There was much information about the inference results, and they included many different combinations of different subsets in testing. It would be much better if only words like important text attributes or other vital information about why distances between nodules are so important were described more.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    7

  • [Post rebuttal] Please justify your decision

    I had a few minor reservations regarding specific explanations; however, they did not significantly detract from the overall quality of the paper. It would have been beneficial if those specific sections had been emphasized more prominently. Nevertheless, I found the paper to be commendable.



Review #2

  • Please describe the contribution of the paper

    This paper presents a novel framework, CLIP-Lung, for predicting the malignancy of lung nodules using textual knowledge as a guide. CLIP-Lung incorporates class and attribute annotations into the training of the lung nodule classifier and introduces a channel-wise conditional prompt (CCP) module to establish consistent relationships between context prompts and specific feature maps.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-written and organized.
    2. The use of CLIP-related technologies in the analysis of lung nodules is novel.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. In the abstract and introduction, the authors describe semantic attributes like subtlety, sphericity, margin, and lobulation as “text information”. These attributes are more accurately labeled as “semantic attribute level information”. It may be worthwhile to consider whether a contrastive language-image pre-training model or vision-language model is necessary for this specific task.
    2. Texture, spiculation, and lobulation are important features for predicting malignancy, and it would be useful for the authors to compare their method with traditional attribute-fusion models to demonstrate the superiority of the CLIP-related approach in this specific task. At least, authors should include some studies focused on malignancy prediction task, since no study mentioned in Table 1 was related to this task.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The technologies used in this paper have good reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Lung nodule malignancy prediction is a classic task in CAD systems, closely related to medical features such as texture and spiculation. It is reasonable to fuse these features to improve malignancy prediction. However, the task itself is not very complex, and the use of new technologies alone does not necessarily improve clinical application. In this study, the authors applied new CLIP-related technologies to the task, but did not explain why this approach is better than traditional deep learning methods. Additionally, the studies mentioned in Tables 1 and 2 are not directly related to nodule malignancy prediction, which makes the results less convincing.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The necessity of adopting this technology has not been explained.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposed a textual knowledge-guided framework for lung nodule malignancy prediction, with a channel-wise conditional prompt module to allow nodule descriptions to guide the generation of informative feature maps and a textual knowledge-guided contrastive learning. A public dataset is used for validation. Visual interpretations are presented by t-SNE and Grad-CAM.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Vivid interpretation by t-SNE and Grad-CAM
    • A new channel-wise conditional prompt (CCP) module is proposed based on CoCoOp.
    • Textual Knowledge-Guided Contrastive Learning for the study baed on contrastive learning.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Though ablation study is performed to validate the effectiveness of different loss functions, the results of the proposed model with I, A, C, I+C, I+A as input respectively are not provided.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper can be reproduced to some extent.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    1) It is suggested to provide justifications of why using L_CA + L_IC decreased the results of L_IC on all datasets. Using L_IC + L_IA + L_CA marginally improved the performance of using L_IC and L_IA. Is it necessary to use L_CA? What’s the performance of using L_CA alone?
    2) It is suggested to make the code publicly available after acceptance.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The contribution of L_CA is questionable and unclear which makes the three branch model not convincing.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    My concerns had been addressed. The explanations should be added to the paper.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The reviewers have acknowledged the work’s strengths but have identified some significant shortcomings. Specifically, reviewer 1 expressed concern about the limited dataset, lacking clear descriptions of how the knowledge is integrated into the model, and some experimental details. Reviewer 2 expressed concern about the motivation of this paper. Reviewer 3 questions the contribution of L_C. In light of these comments, I kindly ask the author to address these issues and the reviewers’ concerns in the Rebuttal.




Author Feedback

We thank the three reviewers and meta-reviewer for their valuable feedback and insightful suggestions. Overall, all reviewers appreciate the novelty of this paper and well-written and organized manuscript .

R#1: First, LIDC-IDRI is the most widely-used benchmark dataset for lung nodule classification, and in our experiments, three constructed subsets are used to evaluate different sets of ranks. We will further evaluate CLIP-Lung on other lesion progression datasets in the extended journal version.

Second, we clarify how the textual knowledge is integrated into the model in detail. Each lung nodule is annotated with scores of attribute texts, then, the latent semantic knowledge delivered by clinicians favored raw texts can be obtained by Text Encoder. Furthermore, we use score normalization to obtain instance-specific attribute knowledge. Hence, the contrastive learning is applied to pull semantic-closer samples and push away semantic-farther ones based on distance or similarity among positive and negative sample pairs. Therefore, the distance between positive and negative pairs is crucial.

R#2: First, the main obstacle of current lung nodule malignancy prediction is still the challenge of reducing false positives and negatives, which is highly correlated to progression labels. However, conventional categorical label-based classification methods and oridnal regression methods are weak in modeling such ordinal information among adjacent rank labels. Fortuanately, clinicians-focused text descriptions could provide more fine-grained and distinctive information. Hence, utilizing contrastive vision-language models is effective to rectify biased visual space obtained by specific tasks or models. Note that our CLIP-Lung only applied text-guided contrastive learning in training phase, it does NOT introduce any computational overhead to inference phase. Therefore, the trained image classifier is still inference-fast and improves prediction accuracy simultaneously.

Second, this work focuses on learning progression labels, and the key motivation relies on nodules with adjacent ordinal ranks are prone to be misclassified, as shown in Fig. 1. Therefore, we compare medically compatible ordinal regression methods to demonstrate the strength of incorporating textual knowledge to help model distinguish adjacent ordinal labels as illustrated in Fig. 3. Comparing with CLIP and CoCoOp is to exhibit superiority of our CCP module in prompt learning. Due to CLIP-Lung can be flexibly combined with other methods via injecting text information into latent space of target vision models, we just implemented MV-DAR (TMI 2022) with and without textual branch of CLIP-Lung, the obtained accuracy values on LIDC-A are 58.9% and 57.3%, respectively, demonstrating the effectiveness of integrating textual knowledge.

R#3: First, L_CA tends to align textual attributes with class-aware prompts, which does NOT affect the learning of image encoder and classifier. Therefore, the results of using L_CA alone is equivalent to those obtained by L_CE in Tables 1 and 2.

Second, the purpose of CLIP-Lung is classification, and L_IC provides direct image-class alignment in latent space. However, L_IA enables alignment between class-agnostic attribute texts and local visual features, which implies NO class information and misleads image features. Hence, we conduct class-attribute (CA) alignment for endowing attribute texts with class information, and this works well in enhancing performance of L_IC+L_IA. When we compare L_IC+L_IA and L_CA+L_IA, the latter performs worse than prior, which demonstrates that the direct IC alignment is superior to the indirect CA alignment w.r.t. image classification. Consequently, L_CA makes attribute knowledge be sensitive to class information and favors classification performance.

MR: We have addressed all the concerns of three reviewers and will make our source code publicly available after acceptance.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors’ rebuttal has proficiently addressed the raised concerns surrounding the dataset, the clarification of textual knowledge, as well as the motivation and contribution of L_C. I believe the application of textual knowledge as a guiding beacon for enhancing the efficacy of lung nodule malignancy prediction to be of substantial value. I recommend acceptance of the paper. The authors should include discussions presented in the rebuttal in the final version if accepted.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors proposed a new framework using textual knowledge as guidance for lung nodule prediction. The paper is well-written and includes interesting analysis. Previous concerns of reviewers have been addressed during the rebuttal. I would suggest ‘Accept’.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper studies the problem of lung nodule malignancy prediction by using a textual knowledge-guided method with a channel-wise conditional prompt module to allow nodule descriptions to guide the generation of informative feature maps and a textual knowledge-guided contrastive learning. The proposed method is validated on a public dataset.

    While R#2 raised several concerns regarding paper writing, the authors have addressed them in the rebuttal. The meta-reviewer agree with the other two reviewers in that the proposed method does have some technical contribution and the paper is generally well-written. Accept.



back to top