Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Zhanyu Wang, Mingkang Tang, Lei Wang, Xiu Li, Luping Zhou

Abstract

Automated radiographic report generation is a challenging cross-domain task that aims to automatically generate accurate and semantic-coherence reports to describe medical images. Despite the recent progress in this field, there are still many challenges at least in the following aspects. First, radiographic images are very similar to each other, and thus it is difficult to capture the fine-grained visual differences using CNN as the visual feature extractor like many existing methods. Further, semantic information has been widely applied to boost the performance of generation tasks (e.g. image captioning), but existing methods often fail to provide effective medical semantic features. Toward solving those problems, in this paper, we propose a memory-augmented sparse attention block utilizing bilinear pooling to capture the higher-order interactions between the input fine-grained image features while producing sparse attention. Moreover, we introduce a novel Medical Concepts Generation Network (MCGN) to predict fine-grained semantic concepts and incorporate them into the report generation process as guidance. Our proposed method shows promising performance on the recently released largest benchmark MIMIC-CXR. It outperforms multiple state-of-the-art methods in image captioning and medical report generation. Our code is available at https://github.com/zwan0839/MSAT.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_63

SharedIt: https://rdcu.be/cVRuO

Link to the code repository

https://github.com/zwan0839/MSAT

Link to the dataset(s)

https://drive.google.com/file/d/1DS6NYirOXQf8qYieSVMvqNwuOlgAbM_E/view


Reviews

Review #1

  • Please describe the contribution of the paper
    1. The paper introduces a bilinear pooling-assisted sparse attention block and embed it into a transformer network to capture the fine-grained visual difference existed between radiographic images. It can explore higher-order interactions between the input single-model (in the encoder) or multi-model (in the decoder) features, resulting in a more robust representative capacity of the output attended features; 2.The paper proposes a medical concepts generation network to provide enriched semantic information to benefit radiographic report generation;
    2. The paper extensively validates the model on the recently released largest dataset MIMIC-CXR.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.The paper injects bilinear-pooling into the self-attention to capture the 2nd or even higher-order interactions of the input fine-grained visual features. 2.To record the historical information, the paper extents the set of keys and values with additional ”memory-slots” to encode and collect the features from all the previous processes.The paper utilized ReLU instead of softmax unit to prune out all negative scores of low query-key relevance, automatically ensuring the sparse property of the attention weight.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1.The article only assessed language fluency, not Clinical Accuracy Performance, and we cannot judge whether it has clinical significance. 2.Does the CLIP module work when testing the model? If it works, where does the report that matches the image get it; if it doesn’t work, how to extract the regional features of the image. 3.Part 2.2 and Part 2.3 Are the same as the content in Attention(), but the results are different.The loss function calculation formula in section 2.3 does not explain how pk is obtained. 4.The model proposed framework in the paper does not correspond very well to the formula proposed in the article. The inputs and outputs of the modules in the framework are not clear, and the corresponding modules are not labeled for what they do. 5.The MSA module in part 2.1 is very confusingly written, and it is difficult to understand what it means if you have not read the reference [13,19].

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper did not provide the code, part of the introduction is not clear, not easy to reproduce.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    1.You can draw the frame diagram more carefully and mark clearly what work each module has to do. 2.When you write formulas, you can list them separately instead of confusing them with textual content.

    1. You should clearly describe the working mode of the model testing phase.
    2. You should explain the working process of the MSA module in more detail.
    3. You should evaluate the content of the production report with clinical accuracy.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper introduces the image-text matching task into the regional features used to extract images, and the addition of bilinear pools improves the high-order interaction between fine-grained features of images.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The authors will then expose the code. The authors explain the evaluation criteria for clinical accuracy and give the corresponding experimental results. The authors explain the working process of the CLIP model.In test phase, only the Image Encoder of the pretrained CLIP module is used to extract the visual features from the input images (no matched reports are used).The authors explain why the IU-Xray dataset is not used.Considering IU-Xray has no official training-test partition, and the obscure random partitions used in previous papers could bias the evaluation, and didn’t use it as the base for comparison in our paper. 



Review #2

  • Please describe the contribution of the paper

    First, this paper proposed a memory augmented sparse attention block to capture the higher-order interactions between the input fine-grained image features. Besides, a novel Medical Concepts Generation Network is proposed to predict fine-grained semantic concepts and incorporate them into the report generation process as guidance. At last, this method outperforms multiple state-of-the-art methods on MIMIC-CXR.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.Memory-augmented Sparse Attention module combines the memory mechanism and squeeze-excitation operation to capture the higher-order interactions. 2.CLIP is used to extract features of the image, which is not used in this task before.

    1. the performance of this method is good.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1.Fig 1 is not clear. I don’t know the meaning of the part of module near the input image. Because the input and output(e.g. Q,K,V) are not clear.

    1. Part of Memory-augmented Sparse Attention is similar to [1], but it is not mentioned in paper.
    2. Medical Concepts Generation Network is common used in this task, such as [2], although the supervised words are not same.
    3. Lack the experiments on IU X-Ray dataset, which is common used in this task.

    [1] Meshed-Memory Transformer for Image Captioning. CVPR 2020 [2] Visual-Textual Attentive Semantic Consistency for Medical Report Generation. ICCV2021

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the paper is good, because the author says source code and the pre-trained models will be made available to the public.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Some weaknesses are listed in question 5. The figure may needs edition and paper needs some explainations about similar work before. The experiment on IU X-Ray is lack.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The performance of the paper is good, and the problems this paper tends to solve are clear. However, some modules are similar to previous works. Therefore, this paper may need to explain the differences. The experiment part is insufficient.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The authors propose a medical report generation model using various components such as sparse nonlinear attention in the transformer, pseudo-medical concepts using RadGraph, and reinforcement learning. This study was evaluated on a MIMIC-CXR dataset, and the importance of each component was justified through structured logic from literature logic as well through experimentation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The motivation for using higher order interaction for extracting fine-grained detail in x-ray images for report generation was well justified and well verified through the experimentation. Coupled with sparse attention, was able to improve the efficiency.

    The proposed MSA module having memory slots with sparse attention was able to perform significantly better than vanilla self-attention.

    The medical concepts from RadGraph enhance the performances by providing semantic information.

    Thorough comparison with state of the art and ablation with various components introduced in the paper

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I found it very hard to follow notations used for defining various variables and weights of a network. It would be helpful to include a chart for it in the supplementary.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The dataset is publicly available. Authors have said that source code and pre-trained models will be made available to the public.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Not clear if metric scores are increasing while using RL is due to training more for 20 epochs or actually because of using RL. For better comparison please provide a score for training the model without RL for 60 + 20 epochs.

    Dc and nm not explained when defined in section 2.1.

    mathbfVc in section 2.4 is a typo from latex, please fix it.

    Organization and notations for various variables are not well explained and are hard to follow. This needs to be fixed.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes an elegant way to solve various problems pertaining to image captioning in medical imaging which usually looks very similar. Using higher order interaction for fine details coupled with memory slots improves the representation capabilities for report generation. Further the use of pseudo-medical concepts and RL help in pushing the performance further.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper attempts to automatically generate a report from chest X-rays. There are many approaches to this problem in literature which are either based on transformer-like methods of finding-guided retrieval from previous reports. In the former approach, usually a set of text labels are generated from images and then used to see text generation. In the current method, the text labels predicted are from a vocabulary formed from RadGraph and the mechanism of label prediction using the CLIP approach which has been pre-trained on large image-text pairs from general images (not necessarily medical images or chest X-rays). The main problem with all these approaches is ensuring clinical validity of findings, and maintaining a text production style close to the radiologists way of describing them. It is not clear which fine-grained findings are covered among those that were generated in earlier approaches. The overall BLEU and other scores are still not as good as previous work based on document retrieval-type approaches which are not even cited in the current work. https://dl.acm.org/doi/abs/10.1007/978-3-030-59713-9_54

    Please address other comments raised by reviewers in the rebuttal.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9




Author Feedback

Thank Meta Reviewer & all reviewers. We will revise our figures and presentation (e.g., more details of MSA) accordingly.

Meta Reviewer: Q1: About the performance against the referred paper (MICCAI2020) We would like to highlight as follows. 1) The result reported in MICCAI2020 is not strictly comparable with ours. That paper used an inhouse dataset composed of samples drawn from MIMIC-CXR, IU-Xray and ChestX-ray8. Moreover, its comparing methods (Vis-Att, MM-Att, KERP, Co-Att, etc.) were evaluated on different datasets such as IU-Xray & IU-RR, with obscure training-test partitions. In contrast, all results reported in our paper are consistently on the largest MIMIC-CXR dataset with the official training-test partition, making them strictly comparable. 2) MICCAI2020 and most of its comparing methods did not release codes, so we could not test them on MIMIC-CXR. We have included more up-to-date methods for comparison. 3) We will release our code.

Reviewer #1: Q2: About the clinical accuracy evaluation For Clinical Correctness, we follow [A1-A2] to measure Keyword Accuracy (KA). KA is the ratio of the number of correctly generated keywords to the number of all keywords in the ground truth findings. Since [A1-A2] did not release their keywords or selection criteria, we constructed a keyword dictionary with 768 keywords based on Radgraph’s high frequency entities. Radgraph contains carefully selected entities annotated by three board-certified radiologists, including the Anatomy entities such as a “lung”, and the Observation entities such as “effusion”, which contain clinical information. We obtain the KA scores of 0.895 for ours, 0.864 for R2Gen, 0.869 for R2GenCMN, 0.882 for Self-boost.

Q3: Working mode of the model (e.g., CLIP4) in test phase In test phase, only the Image Encoder of the pretrained CLIP module (“ViT-B/16”) is used to extract the visual features from the input images (no matched reports are used). It is based on vision transformer, which divides the input image (224 * 224) into 196 patches of size 16*16, allowing regional features to be used by our model.

Reviewer #2: Q4: Difference with the referred works [1] (CVPR2020) and [2] (ICCV2021) The work [1] is referred as M2Transformer [6] in our paper. [1] and our work are only conceptually similar in using memory in attention. However, they focus on different attention mechanisms (self-attention in [1] and bilinear pooling based attention in ours), and ours further introduces sparsity into the proposed attention. Our model outperformed [1] in Table 1. As for the medical concept generation network, our model proposed to extract fine-grained (768) medical concepts via Radgraph to ensure semantic consistency, while [2] still employed relatively sparse concepts (18 diseases and 32 description patterns) as conventional methods.

Q5: Experiments on IU-Xray Considering IU-Xray has no official training-test partition, and the obscure random partitions used in previous papers could bias the evaluation, we didn’t use it as the base for comparison in our paper. As suggested, we test our model on IU-Xray and achieve the results (Bleu4/Rouge/Meteor/CIDEr) of (0.172/0.364/0.190/0.621), comparable to the SOTA method R2Gen (0.165/0.371/0.187/0.575) under the same partition. Also, ours shows more advantages on larger datasets, e.g., we additionally test CANDID-PTX (19237 samples) and obtain (0.139/0.311/0.149/0.385), against (0.133/0.302/0.144/0.339) for R2Gen.

Reviewer #3: Q6: The gain of RL As suggested, we trained our model without RL for 60+20 epochs and achieved (Bleu4/Rouge/Meteor/CIDEr) as (0.119/0.282/0.142/0.296), lower than ours using RL for 20 epochs (0.136/0.298/0.170/0.429), showing the benefit is brought by RL instead of longer training.

[A1] Multimodal recurrent model with attention for automated radiology report generation. MICCAI’18.

[A2] Multi-Attention and Incorporating Background Information Model for Chest X-Ray Image Report Generation. IEEE Access’19.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal hasn’t addressed the fundamental questions of clinical validity (BLEU scores don’t indicate that) nor provided a good rationale for their approach being superior to document retrieval approaches.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Given the reviewer support, and subject to a thorough proof reading, I vote to accept.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have provided a reasonable rebuttal addressing the reviewers’ concerns. They are encouraged to address those in the final paper and conduct a full proof-reading.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



back to top