Authors

Ming Kong, Zhengxing Huang, Kun Kuang, Qiang Zhu, Fei Wu

Abstract

Medical report generation, which aims at automatically generating coherent reports with multiple sentences for the given medical images, has received growing research interest due to its tremendous potential in facilitating clinical workflow and improving health services. Due to the highly patterned nature of medical reports, each sentence can be viewed as the description of an image observation with a specific purpose. To this end, this study proposes a novel Transformer-based Semantic Query (TranSQ) model that treats the medical report generation as a direct set prediction problem. Specifically, our model generates a set of semantic features to match plausible clinical concerns and compose the report with sentence retrieval and selection. Experimental results on two prevailing radiology report datasets, i.e., IU X-Ray and MIMIC-CXR, demonstrate that our model outperforms state-of-the-art models on the generation task in terms of both language generation effectiveness and clinical efficacy, which highlights the utility of our approach in generating medical reports with topics of clinical concern as well as sentence-level visual-semantic attention mappings. The source code is available at https://github.com/zjukongming/TranSQ.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_58

SharedIt: https://rdcu.be/cVVqf

Link to the code repository

https://github.com/zjukongming/TranSQ

Link to the dataset(s)

https://openi.nlm.nih.gov/

https://physionet.org/content/mimic-cxr/2.0.0/

Reviews

Review #1

Please describe the contribution of the paper
1. The authors propose a Transformer-based Semantic Query (TranSQ) model to generate medical reports. The approach considers report generation as a sentence set prediction and selection problem. So, it learns visual embeddings and semantic queries for the sentence candidate set. It is apparently the first work to consider medical report generation as a candidate set prediction and selection problem.
2. The authors conducted experiments to show that TranSQ achieves good performance against existing methods based on NLP metrics and clinical efficiency metrics.
3. They also provide a sentence-level interpretation of the report to illustrate the approach’s explainability.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Motivation for the paper is clearly explained.
2. Methods used in the paper are interesting and distinct from the current state of the art work in the medical report generation field.
3. First approach to treat report generation as a sentence set prediction and selection problem.
4. Methods section has been clearly explained with the motivations for each aspect of their work (visual feature extractor, semantic encoder, and report generator).
5. Comparison against other methods has been shown and results surpass previous state of the art methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Failure modes of the approach have not been discussed.
2. Current NLP metrics (e.g. ROUGE-L and BLEU-1) will fail when there is finding uncertainty or absence. Perhaps using CIDEr is a better metric to compare against to detect the presence of abnormalities (Pino 2022 AIIM-D, SSRN).
3. Grammatical mistakes are seen in the paper (e.g. Sec 2.3, paragraph 1 - don’t start a sentence with a conjunction like ‘and’). But, overall the paper is clear and easy to follow.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I think the paper can be reproduced sufficiently. Hopefully, the authors will provide their code and make it public as it will be of use to the medical imaging community.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. How long does the model take to train on the dataset?
2. Were there instances when the model did not perform adequately? What were the main reasons for failure and your best guess as to why?
3. Bipartite set matching with a vision transformer runs into issues when detecting small objects (see DETR Carion 2020). Does your model have similar issues as well? What is the smallest finding that it did not do well on?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. The paper has a reasonable motivating factor and the proposed approach seems to work very well on the MIMIC-CXR dataset as evidenced by the results compared to the current state of the art.
2. The authors have conducted an experiment where they provide visual attention maps for each sentence, which is useful as it enables the model to indicate the image regions that lead to its confidence in providing targeted sentence candidates.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

Proposes a system to generate reports from images by proceeding sentence by sentence. A set of visual semantic queries are created. Each query is then used to probe one specific aspect of the image, and the report is constructed from the collection of results
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper advances the SOTA on the report generation problem on two widely used datasets.
The comparison to previous work is thorough and clear.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The use of this part based approach leads immediately to questions about overall ordering of the sentences in the result. The method for ordering is defined, but does this affect the score?
Some of the example sentences in Fig 2 seem to refer to priors e.g. “stable” “have increased”– but there are no priors fed to the system, so the system has picked up this language from the training data. Would be good to discuss.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Reproducibility claims software released and documented, but there is no reference to this in the paper (expecting an anonymized footnote.) The reproducibility statement refers to the supplementary material, but it does not seem to have been submitted.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

It could be interesting to explore the stability under small jittering of the image which would cause the VIT to differently tokenize it.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

A new SOTA on the report generation problem is reached, with thorough comparison to prior work.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper
1. This work proposes to treat the medical report generation task as a sentence candidate set prediction and selection problem, which is novel and interesting.
2. To solve this task, this work proposes a novel Transformer-based Semantic Query (TranSQ) model, which incorporates the well-known and powerful Vision Transformer (ViT) to significantly boost the performance.
3. The visualization shows that the proposed approach can achieve better interpretability than previous works.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is well-written and easy to follow.
2. The proposed approach is interesting and novel.
3. The experiments on two benchmark datasets show that the proposed approach can achieve state-of-the-art performances.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
Please see the Q8 for details.
1. Some very important analyses are missing, e.g., the ablation study.
2. Some important implementation details are missing.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I believe that the obtained results can, in principle, be reproduced. Even though key resources (code) are unavailable at this point, the key details (e.g., proof sketches, experimental setup) are sufficiently well described for an expert to confidently reproduce the main results, if given access to the missing resources.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
Strengths:
1. The paper is well-written and easy to follow.
2. The proposed approach is interesting and novel.
3. The experiments on two benchmark datasets show that the proposed approach can achieve state-of-the-art performances.
Weaknesses:
1. Some important implementation details are missing.
  - Although the authors report a very good performance, it is not clear to me which part of their method is responsible for it. In particular, the proposed model incorporates the powerful Vision Transformer (ViT), which is pre-trained on large-scale datasets, and there is no experiment to show how much improvement is brought by the existing ViT and how much improvement is brought by your proposed approach.
  - So I wonder if previous works adopt the ViT as the image feature extractor, can they achieve better performances than your approach?
2. The novelty of the idea is limited.
  - Although the authors claim that “they make the first attempt to address the medical report generation in a candidate set prediction and selection manner”, in my opinion, this idea is very similar to the existing retrieval module in medical report generation [1]. What are the main differences between this work and previous retrieval modules?
3. Some implementation details are missing.
  - How to initialize the semantic queries, and how to ensure that these K queries have different latent topic definitions? Did you visualize them to prove it?
  - How to conduct the retrieval process in the proposed approach?
  - How to obtain/construct the database used for retrieval?
4. I recommend the authors add a related work section to better discuss the difference between this work and previous works. Meanwhile, it can help the readers better understand the contributions of this work.
5. Although some newest methods, e.g., [2][3], perform worse than this paper, I still recommend the authors quote them in the Tables.
typos:
1. ‘To solve the problem, We consider’ -> ‘To solve the problem, we consider’;
[1] Hybrid retrieval-generation reinforced agent for medical image report generation. In NeurIPS, 2018. [2] Competence-based Multimodal Curriculum Learning for Medical Report Generation. In ACL, 2021. [3] Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation. In ACL, 2021.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed approach is novel and interesting, but some important experiments and implementation details are missing. In particular, the paper did not conduct the ablation study to analyze the contribution of incorporating the existing powerful Vision Transformer (ViT), which may bring significant improvements to performance. So it’s unclear how much improvement is brought by the existing ViT and how much improvement is brought by the proposed approach.
Number of papers in your stack

7
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

All three reviewers commented on the novelty of the proposed approach and state-of-the-art performance on two benchmark datasets. The reviewers pointed out several weaknesses such as lack of discussion on failure modes, lack of clarity on ablation study, and missing details on how sentences are ordered, etc.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

1

Author Feedback

Dear Reviewer/AC/PC of MICCAI: First, we want to express our sincere gratitude to all reviewers for their valuable affirmations, comments, and suggestions. Before we provide the point-to-point reply to the reviewers’ comments, we apologize for failing to upload the supplementation material, which includes the results analysis of the semantic query set size and their latent topic meanings. With any chance, we hope to re-upload it during the camera-ready period.

Reviewer#1:
Q1-1：Failure mode of the approach needs discussion. A1-1：Rare/masked words (e.g., XXXX or ) or uncommon abnormalities in the training set (e.g. t-spine osteophytes) may cause failure mode. We will show some failure mode cases in the open-source code and attach the link in the final version.

Q1-2: Included ROUGE-L instead of CIDEr. A1-2: For the convenience of performance comparison, we follow the evaluation metrics that are commonly applied by existing works, i.e., BLEU, METEOR, and ROUGE-L.

Q1-3: Grammatical mistakes. A1-3: Thanks for your comments. We’ll fix these issues in the final version.

Reviewer#2: Q2-1: How sentence order affects the score? A2-1: Due to the precise prediction of the single sentence, TranSQ obtained state-of-the-art performance even with random ordering. Specifically, TranSQ achieved (BL-1/BL-2/BL-3/BL-4/MTR/RG-L): 0.423/0.259/0.168/0.112/0.167/0.263 for MIMIC-CXR and 0.484/0.333/0.238/0.176/0.207/0.365 for IU X-Ray.

Reviewer#3: Q3-1: Lack of ablation study and implementation details A3-1: About the effects of ViT, it is more suitable to extract local visual features for semantic queries than the CNN-based model, and some baselines (e.g., R2Gen) are also based on it. But it is worthy to research the influence of applying ViT to the previous methods. About the semantic queries, we randomly initialize the semantic query embeddings. We discussed and visualized the meaning of semantic queries in the supplementation material, showing that most of the semantic queries correspond with one or two medical terms, but failed to upload. We will make an additional submission with any chance or publish it in another way.

Q3-2: Difference with previous retrieval-based methods.
A3-2: Previous retrieval-based method generates an (ordered) sequence of topic status from the image, then generate a series of sentences. Some of them generate the topic status iteratively based on the hypothesis that the inter-sentence relationship follows content logic, and may fail for the long sequence. Besides, some need pre-define the medical terms/templates. We believe that the inter-sentence order of medical reports is more likely to be determined by writing conventions than content logic, so it is more important to guarantee the accurate intention of the single-sentence generation. Based on this, we generate a set of (unordered) sentence candidates based on the semantic query set, i.e., direct set prediction, making sure the single-sentence generation is related to the latent topics. The embeddings of the semantic queries are trained with a bipartite matching process without pre-definitions.

Q3-3: How to construct the retrieval database and conduct the retrieval process. A3-3: As we mentioned in Implementation Details of 3.1, the retrieval database is conducted with the sentences of the training set. The retrieval process is to select the sentence whose embedding is with the maximum cosine similarity with the generated sentence embedding in the dataset.

Q3-4: Need to quote additional references. A3-4: Thanks for the advice. We will attach the mentioned references and their results in the final version. Dear Reviewers of MICCAI: Thanks for your valuable affirmations, comments, and suggestions. We hereby make responses to some of your comments and suggestions:

C1: How long does the model take to train on the dataset? R1: Under our settings, for IU X-ray, ~2min for an epoch (~1min for train and ~1min for val) and ~1.5h for the entire 50 epochs; for MIMIC-CXR, ~1h for an epoch (~40min for train and ~18min for val) and ~20h for the entire 20 epochs.

C2: Does the model have an issue with small findings? R2: To the best of our observations, TranSQ performs well for the small findings such as postoperative traces or pacemakers. There may be two reasons: 1. small findings for medical images have similar visual features; 2. bbox regression on small findings may be the cause of performance hurts for DETR, while it is irrelevant to our approach.

C3: It could be interesting to explore the stability under small jittering of the image. R3: Thanks for your advice. We believe it is an interesting issue for experimenting with model stability and interpretability, and we’ll work on it in future works.

back to top

TranSQ: Transformer-based Semantic Query for Medical Report Generation