Authors

Maxime Kayser, Cornelius Emde, Oana-Maria Camburu, Guy Parsons, Bartlomiej Papiez, Thomas Lukasiewicz

Abstract

Most deep learning algorithms lack explanations for their predictions, which limits their deployment in clinical practice. Approaches to improve explainability, especially in medical imaging, have often been shown to convey limited information, be overly reassuring, or lack robustness. In this work, we introduce the task of generating natural language explanations (NLEs) to justify predictions made on medical images. NLEs are human-friendly and comprehensive and enable the training of intrinsically explainable models. To this goal, we introduce MIMIC-NLE, the first, large-scale, medical imaging dataset with NLEs. It contains over 38,000 NLEs, which explain the presence of various thoracic pathologies and chest X-ray findings. We propose a general approach to solve the task and evaluate several architectures on this dataset, including via clinician assessment.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_67

SharedIt: https://rdcu.be/cVRzl

Link to the code repository

https://github.com/maximek3/MIMIC-NLE

Link to the dataset(s)

https://github.com/maximek3/MIMIC-NLE

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes an approach for extracting natural language explanations for conclusions from radiology reports. The approach is used to generate and publish a new dataset, MIMIC-NLE. The paper establishes performance baselines on this dataset.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper proposes an interesting extension to the MIMIC CXR dataset and a general method for producing such extensions. Both numerical and clinician evaluations are performed.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

There is a strong focus on explanations for positive findings, however radiographic exams are often used to rule out hypothesis. Natural language explanations for negative findings would be relevant as well. Some aspects of the work are simply asserted, without proof or reference. E.g. “we observe that […] a small selection of phrases, … are very accurate identifiers” How accurate? How did the authors avoid confirmation bias here? E.g. The evidence graph was constructed using prior radiologist knowledge and by empirically validating the coocurrences. How many radiologists? Authors or non-authors, and could the empirical validation be provided? The example discussed of how “a model that generates generic NLEs that make reference to Lung Opacity” will yield a good score” may indicate a more serious problem than the authors suggest. The baseline results provided are not entirely convincing.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility statement claims code, and script to reproduce the results will be released, but I see no placeholder for this in the text.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Spelling error p 4 p 4 radiolographic
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Natural language explanations are relevant but underexplored. This paper contributes a new dataset and approach. Unfortunately, several aspects seem to be asserted without strong enough basis, and the limitations in scope of the resulting dataset are not sufficiently discussed.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Somewhat Confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper introduce the task of generating natural language explanations (NLEs) to justify predictions made on medical images. As a first step, authors created MIMIC-NLE, the first, large-scale, medical imaging dataset with radiological NLEs and contains over 38,000 NLEs, which explain the presence of various thoracic pathologies and chest X-ray findings. In addition, authors proposed a general approach to solve the task and evaluate several architectures on this dataset, including via clinician assessment.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

– a new dataset is presented and will be useful for community – presented an approach to provide NLEs for multi-label classification – evaluation is validated by a clinician.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

– Contribution is limited. Presented a new data collected from already public dataset. – I could not find any explaining factor in the paper. Traditionally, explaining means that authors will explain how their results/method they used in not a black box but an explainable. (Title of a paper is misleading) – Previous methods are used to set the baselines results. Authors claimed they proposed a new method. – all baselines are from previous studies including DPT is inspired from previous SOTA method. DPT leverages a DenseNet and GPT-2. – This paper lacks a discussion on the experimental results and motivation analysis. – – The results lacked visualization or statistical analysis.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It is OK, and the authors will release the source code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

– Contribution is limited. Presented a new data collected from already public dataset. – I could not find any explaining factor in the paper. Traditionally, explaining means that authors will explain how their results/method they used in not a black box but an explainable. (Title of a paper is misleading) – Previous methods are used to set the baselines results. Authors claimed they proposed a new method. – all baselines are from previous studies including DPT is inspired from previous SOTA method. DPT leverages a DenseNet and GPT-2. – This paper lacks a discussion on the experimental results and motivation analysis. – – The results lacked visualization or statistical analysis.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Altough the new data is useful but paper lacks discussion. The motivation and detailed discussion is missing. Title of a paper is misleading, explaining refers to show that method is explainable with visualization. Paper lacks the visualization.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Somewhat Confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The paper introduced the first dataset of natural language explanations (NLEs) to justify predictions made on medical images. The authors validated a novel approach to generate NLEs for multi-label classification. It automatically distilled NLEs from radiology reports from the MIMIC-CXR dataset and created a new dataset called MIMIC-NLE. The paper also proposed self-explaining models that learn to detect lung conditions and explain their reasoning in natural language.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is interesting. It proposed to fully capture how evidence in a scan relates to a diagnosis. Specifically, the predictive model aims to detect lung conditions and explain their reasoning in natural language.

The new dataset MIMIC-NLE (38,000 high-quality NLEs from the over 200,000 radiology reports) can be served as an important resource that provides chest x-rays with diagnoses and evidence labels and NLEs for the diagnoses.

The paper proposes the first general approach to provide NLEs for multi-label classification, that encourages new research on NLE for chest X-ray interpretation.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. It is difficult to evaluate NLEs with automatic NLG metrics.
2. The GT NLEs obtain an absolute rating score of 3.2/5 is quite low.
3. It’s not clear to me that an absolute rating score of 3.2/5 can be explained by a generic inter-annotator disagreement between the clinician and the author of the reports or it comes from another source. The author can verify this assumption via an additional evaluation.
4. The clinical evaluation should be performed by a consensus of clinicians.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors claim that their dataset and code will be made publicly available upon acceptance.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The clinical evaluation should be performed by a consensus of clinicians.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

New large-scale, annotated dataset and new evaluation framework for chest X-ray analysis with natural language explanations.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
Strengths:
- A new large-scale dataset, named MIMIC-NLE (38,000 natural language explanations) is introduced by generating natural language explanations on MIMIC-CXR. This can be used for future XAI studies where the studies investigate the method for generating reasoning of prediction in natural language.
- The paper presents an approach to provide NLEs (natural language explanations) for multi-label classification problems.
Weaknesses:
- The natural language dataset only focuses on explanations for positive findings. Natural language explanations for negative findings would be also important.
- Some part of the papers needs more clarification and references.
- The scope and limitation of the dataset have not been sufficiently discussed.
- Evaluation of NLEs would be an important issue but it is a bit unclear and not satisfactory.
- The current title might mislead the readers. It would be great to refine it.
Overall: In this paper, the authors propose an interesting dataset for natural language explanation. Explaining the predictions at the level of medical experts is an important next step for reliable deep learning in the medical domain. Although the proposed dataset is grounded from the existing public data, the annotation of the explanation is new. All reviewers agree that the proposed dataset has merit. Publishing this new data will be able to make interesting follow-up studies in this area. Although there are some weaknesses in the current version of the paper, this paper definitely has merit for the MICCAI community. I would suggest the acceptance of this paper. It will be great to address the concerns of the reviewers in the final version of the paper and publish the dataset for future studies.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

2

Author Feedback

We thank the reviewers for their detailed comments and address their questions below.

NLEs for negative findings (R1): In this work, we focus on explanations for both uncertain and positive findings, i.e., explanations for the (likely) presence of different lung conditions. Our main motivation for this was that negative findings, e.g., stating that clearly no pneumonia is suggested from a chest X-ray, will generally not require case-specific explanations. For example, a template-like sentence such as “The lungs are clear” would be a valid NLE for most negative cases of pneumonia. For uncertain cases, our NLEs can either act as case-specific explanations for why a condition is present or why it may look like it is, but it’s actually a negative finding (for example: “There is subtle opacity adjacent to left heart border which could represent a prominent fat pad versus a very early pneumonia.”).

Dataset generation assertions (R1): We identified 14 phrases, given in Table 6 in the Appendix, that were used to identify sentences of explanatory nature. These phrases were selected by exploring the data, referring to the literature, and having separate discussions with three clinicians/radiologists. Their quality was confirmed when one of the paper’s authors evaluated a sample of 100 extracted explanations, of which 92% were accurate NLEs. The evidence graph was mainly designed in discussion with a non-author radiologist. Its purpose is to be a foundation from which we could establish the rules given in Table 1. Thus, each of these evidence links has been validated manually (i.e., over 90% of the sentences containing them had to be valid NLEs) before being solidified as a rule.

CLEV score limitations (R1): In the paper, we mentioned that “a model that generates generic NLEs that make reference to Lung Opacity will yield a good score”. This limitation specifically applies to the CLEV score, as it does not take into account much of the diversity that is inherent in our NLEs, such as the location of findings, their size and appearance, as well as how they relate to the condition that is predicted. This richness and diversity of the NLEs prevent “generic NLEs that make reference to Lung Opacity” from being satisfactory in any context outside of the CLEV score.

Misleading title (R2): Our paper proposes an alternative form of explanation to visualization, namely, natural language explanations (NLEs). In both NLP and computer vision, there are a plethora of works on NLEs (including for applications such as self-driving cars [1]), which are generally accepted as a type of explanation, since they justify an answer. Moreover, previous work has suggested that visualization methods such as saliency mapping are insufficient for medical imaging [2, 3]. This is one of the first applications of NLEs to medical imaging and we do not believe that our title is misleading.

[1] Kim, J. et al. 2018. https://doi.org/10.1007/978-3-030-01216-8_35. [2] M. Ghassemi et al. 2021, https://doi.org/10.1016/S2589-7500(21)00208-9.

Low GT NLE score (R3): The clinician rating score of the GT NLEs is 3.2/5, which is arguably lower than expected. After the evaluation, we discussed these results with the clinician and found that, in a majority of cases where the GT was scored poorly, the clinician did not agree with the explanation provided by the author of the original report.

Scope and limitation of the dataset (R2): An inherent limitation of this dataset is the inter-annotator disagreement that exists between radiologists, i.e., the upper bound of the quality of this dataset is given by the radiologists that wrote the original reports. Our automatic extraction process introduces around 8% of sentences that are not valid NLEs, in that they don’t provide an explanation for the label that is predicted. Nonetheless, many of these sentences still contain valuable information, such as descriptions of the condition at hand.

back to top

Explaining Chest X-ray Pathologies in Natural Language