Authors

Constantin Seibold, Simon Reiß, M. Saquib Sarfraz, Rainer Stiefelhagen, Jens Kleesiek

Abstract

When reading images, radiologists generate text reports describing the findings therein. Current state-of-the-art computer-aided diagnosis tools utilize a fixed set of predefined categories automatically extracted from these medical reports for training. This form of supervision limits the potential usage of models as they are unable to pick up on anomalies outside of their predefined set, thus, making it a necessity to retrain the classifier with additional data when faced with novel classes. In contrast, we investigate direct text supervision to break away from this closed set assumption. By doing so, we avoid noisy label extraction via text classifiers and incorporate more contextual information. We employ a contrastive global-local dual-encoder architecture to learn concepts directly from unstructured medical reports while maintaining its ability to perform free form classification. We investigate relevant properties of open set recognition for radiological data and propose a method to employ currently weakly annotated data into training. We evaluate our approach on the large-scale chest X-Ray datasets MIMIC-CXR, CheXpert, and ChestX-Ray14 for disease classification. We show that despite using unstructured medical report supervision, we perform on par with direct label supervision through a sophisticated inference setting.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_66

SharedIt: https://rdcu.be/cVRzk

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

Existing frameworks for computer-aided diagnosis tools rely on a fixed set of predefined categories automatically that are extracted from medical reports. This paper proposes a framework to go beyond this assumption to make diagnosis more context aware.

Methodologically, they employ a contrastive global encoder-decoder designed to uncover latent concepts from unstructured medical reports, while still retaining the ability for free form classification. They also investigate properties of such free -form recognition and propose a method to employ weakly annotated data to improve training.

They evaluate on large-scale chest X-Ray datasets such as MIMIC-CXR, CheXpert, and ChestX-Ray14 to demonstrate the efficacy of their method in comparison to methods employing direct supervision.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is extremely well written. The explanation of the problem and proposed solution is precise, yet well motivated, and detailed enough.
2. Evaluation is performed on large-scale datasets and extensive investigation has been performed on various factors impacting model performance, such as prompt engineering, ablations on loss components and model heads. The experimental design is thorough, which is a big plus for potential translational applications. Extensive and relevant baseline comparisons have been provided
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The model has four different hyper parameters \lambda_1 - \lambda_4 to weigh the contributions of various loss terms. How do they arrive at the settings mentioned below Eq. 5? Which of these parameters is the performance most sensitive to?
2. Some of the differences reported in Table 1 are rather minor and do not report a measure of standard deviation. How consistent are these differences and would they hold up under a statistical comparison?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This looks alright to me.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
Minor :
1. Typo in the line below Eq. 5, there are two settings for \lambda_4 mentioned
2. The explanation above Eq. 1 for breaking the symmetry is a bit hard to parse. Perhaps a few lines of explanation can be provided on which portion of the loss differs from the traditional MILNCE construction
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well structured, with careful and thorough experimental design with the claims being made supported by the results. There are a few minor points that need addressing (see weaknesses). However, overall the paper is in good shape.
Number of papers in your stack

7
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The authors propose a method for training a framework to detect different diseases in chest X-ray images. The training is performed with report supervision using local (sentence) and global (full report) levels.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This work presents a novel technique for integrating radiology reports into model training for different disease detection in chest X-ray. The authors propose a method for adapting the well known contrastive language and image pretraining for recognition of natural images to more complex text such as radiology reports. The analysis provided and results is clear and well defined.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

In some cases, the improvement in the results is not very big. A statistical analysis would be helpful to better understand the significance of the obtained improvement
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Good reproducibility. Architecture is well defined and results are demonstrated in open source datasets.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

This is a well written manuscript that presents relevant work on a widely studied topic. The reduction in the need of data annotation is a very helpful tool for any other medical image application. A more in depth statistical analysis of the results may be helpful to better understand the impact of the improvements achieved
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors present an elaborated framework for multimodal training that involves both image and text, obtaining text information from radiology reports both locally and globally. This framework is novel and very valuable for the community, along with the introduction of self-supervised learning and a more wide range of diseases that do not require explicit annotation.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

This paper aims to enable the neural networks used to medical images less reliant to label supervision. To this end, this work proposes a novel contrastive language-image pre-training method. The extensive experiments and analyses on four datasets, i.e., MIMIC-CXR, CheXpert, ChestX-ray14, and PadChest prove the effectiveness of the proposed approach.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed approach is well-motivated and novel for medical images.
2. The experiments and analyses are very extensive.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The novelty of the framework is limited.
1. In my opinion, this work mainly adopts the existing contrastive language-image pertaining model, i.e., CLIP [1], from computer vision to a new domain or a new task. However, introducing large pre-trained models to solve a new downstream task cannot bring new insights to the community. It’s very important to explain why the proposed approach can improve the performance and what problems can be solved by the approach?
2. There are some previous works, e.g., [2], that have attempted to adapt the contrastive language-image pertaining model into the medical image analysis field. The authors neither cite nor compare with it.
[1] Learning Transferable Visual Models From Natural Language Supervision. 2021. [2] Contrastive Learning of Medical Visual Representations from Paired Images and Text. 2020.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I believe that the obtained results can, in principle, be reproduced. Even though key resources (code) are unavailable at this point, the key details (e.g., proof sketches, experimental setup) are sufficiently well described for an expert to confidently reproduce the main results, if given access to the missing resources.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. The novelty of the framework is limited. In my opinion, this work mainly adopts the existing contrastive language-image pertaining model, i.e., CLIP [1], from computer vision to a new domain or a new task. However, introducing large pre-trained models to solve a new downstream task cannot bring new insights to the community. It’s very important to explain why the proposed approach can improve the performance and what problems can be solved by the approach?
2. I recommend the authors add a Related Work section to help the readers better understand the differences between this work and previous works, .e.g, [2]. Besides, I recommend further discussing the advantage and disadvantage of each previous work, instead of just listing them, which can help the readers understand the strengths and weaknesses of this work.
3. Many important hyper-parameters are missing, such as learning rate, batch size, and the number of epochs, which hinder reproducibility. Besides, it is necessary to report some model details and training details for the proposed approach, for example, the training time, numbers of model parameters and memory cost, and so on.
4. The paper is written in an optimistic tone that leads the reader to assume the proposed approach is rather good. However, I am more interested in knowing if the approach brings errors? And what type of errors does it bring? And why?
5. I would like to see a statistical significance test, due to the performance gap between the proposed approach and the previous state-of-the-art methods is small.
[1] Learning Transferable Visual Models From Natural Language Supervision. 2021. [2] Contrastive Learning of Medical Visual Representations from Paired Images and Text. 2020.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. The novelty of the proposed framework is limited. In my opinion, introducing large pre-trained models to solve a downstream task cannot bring new insights to the community. It’s very important to explain why the proposed approaches can solve the claimed problems.
2. It’s not clear why the proposed approach can improve the fluency, adequacy, and fidelity of novel object captions.
3. Some important analyses should be perfomed to prove the contributions and claims of paper.
Number of papers in your stack

7
What is the ranking of this paper in your review stack?

4
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The reviewers and I agree that moving away from a fixed set of findings in training a CAD classifier is an important and useful paradigm shift. The improvements reported are small, But I think the value is in methodology.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

4

Author Feedback

We thank all reviewers for the valuable feedback and their positive assessment of our work. We are pleased that our proposed approach of breaking away from label-based training for pathology recognition in chest radiographs by utilizing report-image pairs during training and contrasting prompting schemes during inference is seen as novel in the domain (R2, R4), well-motivated (R1, R2) and -evaluated (R1, R2). Below, we address the concerns of the reviewers.

(R1, R2, R4): “The work would benefit from a statistical analysis.” While we agree with the reviewers that performing evaluation schemes such as k-fold or leave-one-out cross-validation would allow us to gain more insight into the impact of the results, however, due to the sheer size of the datasets and the resulting computational effort, we did not see this as a feasible approach at the time and opted for a thorough evaluation on multiple datasets to get insight into generalization capabilities.

(R1): “How are the \lambda hyperparameters set?” We did not perform an extensive grid search on the impact of the parameters due to the computational effort of these experiments but set these values intuitively based on the inverse of the number of present contrastive loss terms in each loss formulation similar to Radford et al.’s CLIP, or Chen and He’s SimSiam.

(R4): “What problems can be solved with the proposed approach?” Compared to natural photos of e.g. a cat, radiological data such as chest radiographs are multifaceted. One sentence might describe the occurrence of a nodule while another the existence of a pneumothorax. Thus, while for natural images a single sentence might be sufficient, a multitude of semantically non-overlapping sentences can be used to describe an X-Ray. While there exist other data-efficient CLIP variants such as DeCLIP or SLIP the currently prevalent way of performing inference for these models does not account for this variety but rather pits the potential classes in a one-versus-all setting which is not transferable to a lot of medical imaging applications. Therefore, apart from adapting the training objectives to this domain, we see our major contribution in our contrastive prompt inference scheme which allows these models to be queried for any finding independently. This feature to the best of our knowledge is not present in any work on contrastive language-image models. Furthermore, our approach offers a natural pathway to integrate weakly annotated medical data into the training process.

(R4): “The authors do not compare with/cite Zhang et al. 2020.” Thank you for pointing this out. We will address this in the final version. Regarding the comparison, Zhang et al. 2020 can be seen as similar to CLIP since the utilized loss terms are technically identical.

(R4): “What are training hyperparameters?” We state hyperparameters like image size, learning rate, and optimizer in implementation details. We will add a table containing all hyperparameter settings to the supplementary.

(R4): “What are some potential pitfalls of the method?” While we believe that our work opens more possible uses for language-image pretraining, we agree with the reviewer that there are potential pitfalls in our work. Similar to other CLIP-based methods the classification performance of the models heavily depends on the similarity of the chosen inference prompts to the natural text occurrence of the respective classes. This can be observed in our work in Figure 2. There, i.e., the proposed detailed prompting scheme is worse than the basic scheme for identifying the class “fracture”. Similarly, the way “support devices” are occurring in reports does not seem to align with the prompting scheme. As such, we see the most potential for pitfalls in the choice of prompting scheme and think that methods that can automate this process can bear the most notable improvement [1].

[1] Zhou, Kaiyang, et al. “Learning to prompt for vision-language models.” 2021

back to top

Breaking with Fixed Set Pathology Recognition through Report-Guided Contrastive Training