Authors

Lin Wang, Munan Ning, Donghuan Lu, Dong Wei, Yefeng Zheng, Jie Chen

Abstract

To avoid the tedious and laborious radiology report writing, the automatic generation of radiology reports has drawn great attention recently. Previous studies attempted to directly transfer the image captioning method to radiology report generation given the apparent similarity between these two tasks. Although these methods can generate fluent descriptions, their accuracy for abnormal structure identification is limited due to the neglecting of the highly structured property and extreme data imbalance of the radiology report generation task. Therefore, we propose a novel task-aware framework to address the above two issues, composed of a task distillation module turning the image-level report to structure-level description, a task-aware report generation module for the generation of structure-specific descriptions, along with a classification token to identify and emphasize the abnormality of each structure, and an auto-balance mask loss to alleviate the serious data imbalance between normal/abnormal descriptions as well as the imbalance among different structures. Comprehensive experiments conducted on two public datasets demonstrate that the proposed method outperforms the state-of-the-art methods by a large margin (3.5% BLEU-1 improvement on MIMIC-CXR dataset) and can effectively improve the accuracy regarding the abnormal structures. The code is available at https://github.com/Reremee/ITA.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_54

SharedIt: https://rdcu.be/cVVqa

Link to the code repository

https://github.com/Reremee/ITA

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

Paper proposes a structure level description of X-ray images which is an interesting idea. In addition, the paper also conducts abnormality detection of each structure alongside an auto-balance loss to solve for the skewness of data availability between normal and abnormal patients.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Interesting idea to solve the problem structure wise instead of as a whole inspiring from how a real doctor would do the job. Using one head for each structure in a multi-head attention network is also a meaningful proposal. Auto balance mask loss is a useful idea.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Fig(1) is not clear. Output after each step is not clear. Notation alone is not clear. Task distillation module is not explained properly.

Few other questions need to be addressed:

What information knowledge graph is providing? What is the type of the data that’s generated after building the knowledge graph? How is it being used by this network in terms of dimension of the data?

Are the image embeddings and classification tokens processed parallely in the encoder.

What is the difference between classification embedding and token? It’s not clear.

There is no mention of ground truth description for each structure.

There is no mention of ground truth labels between normal and abnormal for every structure.

How much performance variation has been observed with auto-balance loss function and weighted loss function can be added.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Reproducible with some efforts.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Fig(1) is not clear. Output after each step is not clear. Notation alone is not clear. Task distillation module is not explained properly.

Few other questions need to be addressed:

What information knowledge graph is providing? What is the type of the data that’s generated after building the knowledge graph? How is it being used by this network in terms of dimension of the data?

Are the image embeddings and classification tokens processed parallely in the encoder.

What is the difference between classification embedding and token? It’s not clear.

There is no mention of ground truth description for each structure.

There is no mention of ground truth labels between normal and abnormal for every structure.

How much performance variation has been observed with auto-balance loss function and weighted loss function can be added.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Paper is well written but need to answer a few question and clarify them with some reasonable justifications.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Somewhat Confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper studies the medical report generation tasks. Different from existing approaches, it shed light more on the special structure of the medical report by introducing a task distillation module. This TD module leverage prior knowledge (keywords) to group sentences in reports into a set of anatomical structures. The text block of each anatomical structure will later individually be used for training a transformer decoder for text generation. The encoder part is built on top of a CNN feature extractor as well as a classer token embedding pool. Features extracted from the image and abnormality type are fed to a transformer encoder. During training to better battle the data imbalance issue, the authors also propose a special sampling method named Auto-Balance Mask. Extensive experiments on two benchmark datasets confirm the superiority of the solution.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1, the structure-aware idea makes great sense. It can help improve the coverage of the generated report on a very large scale. In the real world, many anatomical components of integral reports are omitted simply because no issue is found. This causes a great volume of context missing for training a model. In this paper, the structure-aware idea is realized by combining the task-distillation module and specialized decoders. Shedding light on retaining all anatomical components is the main contribution of this paper.

2, as discussed above, context is omitted intentionally in the real world. To counterattack, the authors propose to randomly attach the normal descriptions (from other reports), this is a big augment for the training

3, moreover, the imbalance of abnormality parts of reports also causes troubles. As a remedy, the paper proposes the Auto-Balance Masks Loss for better training.

4, comprehensive experiments both confirm the superiority of the whole solution as well as the necessity of each technical component
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Despite well formulated ideas and system designs, I have following questions: 1, how are the prior keywords determined? It seems like these keywords are predefined before the TD module. Any discussion in this regard or examples of keywords should be helpful.

2, the interaction between TD and the transformer encoders are not discussed very well. Even though I have the rough idea of the interaction as discussed in the strength section, I am still not very sure my understanding is correct or not.

3, how to evaluate the TD module at the first place? From my understanding and the ablation study (Table 2), this TD module help provide to training data (?) for the decoders in TRG. If TD fails over, presumably the whole model should degrades.

4, during inference, the abnormality types are also need for the report generation. How would performance change (presumably degrade) without knowing the abnormality?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

positive if code are released
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

Please address the questions listed above. Other than that I have the following suggestions (non-decision related): 1, do multiple rounds of proofreading especially for breaking down long sentences into small pieces. e.g. in the abstract, the sentence that starts with “Therefore, we propose” spans 9 lines, making it really hard to follow.

2, Fig 1. seems to have typos in the red blocks. Since this is the main illustration of a solution, it is not acceptable that misinformation exists.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The main idea of this paper is to break down medical report into anatomical parts and generate the documents for each part instead of generate them as whole. This idea is adds small complexity while efficient to largely improve the performance.
Number of papers in your stack

1
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper describes a framework for radiology report generation. A novel task-aware framework is proposed that is composed of a task distillation module turning the image level report to structure-level description. There is also a classification mechanism to identify and emphasize the abnormality of each structure.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

novel task-aware framework is proposed where various information from the dataset is extracted and group the descriptions into several anatomical-structure-specific aspects instead of generating the overall report at once
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

discussion about misclassifications and places where the reports are not correctly generated.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

yes
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

This paper is interesting.

Comments below:

Discuss about the possible missclassifications and mislabelling (wrong report generated). Possible ways to handle them and how severe are those errors.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

novel approach
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

All three reviewers liked the idea of the paper, which is turning the image level report to structure-level description so that context about anatomical structures (including normal ones) can be generated. Experimental validation is convincing as well, demonstrated by superior results on two datasets. The reviewers also pointed out some weaknesses such as missing details, lack of discussion of failure cases, etc.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

3

Author Feedback

We thank all the reviewers for their constructive comments and the recognition of our novelty and contribution. We summarize our replies to all reviewers’ questions in the weaknesses section following and we believe these issues can be addressed in the final version:

R1.1 The knowledge graph contains important structures in chest radiographs, along with related normal and abnormal keywords. The former is used to guide the decomposition of the original report, while the latter is used to determine whether the structure is normal or not.

R1.2 Yes. The image and classification tokens are concatenated and processed parallelly in the encoder.

R1.3 Similar to VIT[1], the embedding is the sequence of image patches as the input to the Transformer, while the token is the predicted output.

R1.4 As show in Fig. 2 and described in Section 2.1: ‘By grouping sentences according to the existing of these keywords, the ground truth report R of radiograph X can be disassembled into task-wise sections’, the ground truth description are disassembled from the origin report. And ‘We supplement the missing descriptions with normal descriptions randomly selected from the entire training set.’

R1.5 The classification labels for each structure are extracted from the structure-wise descriptions in Section 2.1. Based on the keywords of the knowledge graph, if the description of a specific structure contains some positive keywords, it is set to abnormal, and vice versa.

R1.6 We will add these results in the final version.

R2.1 We take two steps to define the keywords. At first, candidate keywords are selected based on prior knowledge from clinical studies [2]. Then, we propose a knowledge graph based on the candidates, select the structure keywords with high priority and judge whether the structure is normal or abnormal.

R2.2 The TD module disassembles the origin report to structure-specific descriptions, so that each head of the Transformer can concentrate on their own structure.

R2.3 Yes, the TD module plays an important role in our framework, and the performance of the subsequent Transformer would degrade significantly with an imprecise TD module.

R2.4 As shown in the ‘TRG’ and ‘TRG+CLT’ row of Table 2, knowing the abnormality can bring an improvement of 1.5% in BLEU-1.

R2.5 Thanks for pointing out the writing issues, we will address them it in the final version. Sorry for the confusion about Fig. 1, but as stated in the legend, head is not the name of organ, but represents a task-specific decoder.

R3 Based on the experiments, the diagnostic performance for structure with more severe sample imbalance is worse than others, such as ‘bone’. Furthermore, our method tends to generate a report with redundant content for images without any abnormality, because physicians tend to write a report describing only major structures for normal images. To address this issue, we plan to add a head to generate a complete report and chose to output it or the structure-wise report based on the normality of image.

[1] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) [2] Gay, S.; Olazagasti, J.; Higginbotham, J.; Gupta, A.; Wurm, A.; and Nguyen, J. 2013. Introduction to chest radiology. In https://www.meded.virginia.edu/courses/rad/cxr/index.html.

back to top

An Inclusive Task-Aware Framework for Radiology Report Generation