Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Chantal Pellegrini, Matthias Keicher, Ege Özsoy, Nassir Navab

Abstract

Radiology reporting is a crucial part of the communication between radiologists and other medical professionals, but it can be time-consuming and error-prone. One approach to alleviate this is structured reporting, which saves time and enables a more accurate evaluation than free-text reports. However, there is limited research on automating structured reporting, and no public benchmark is available for evaluating and comparing different methods. To close this gap, we introduce Rad-ReStruct, a new benchmark dataset that provides fine-grained, hierarchically ordered annotations in the form of structured reports for X-Ray images. We model the structured reporting task as hierarchical visual question answering (VQA) and propose hi-VQA, a novel method that considers prior context in the form of previously asked questions and answers for populating a structured radiology report. Our experiments show that hi-VQA achieves competitive performance to the state-of-the-art on the medical VQA benchmark VQARad while performing best among methods without domain-specific vision-language pretraining and provides a strong baseline on Rad-ReStruct. Our work represents a significant step towards the automated population of structured radiology reports and provides a valuable first benchmark for future research in this area. We will make all annotations and our code for annotation generation, model evaluation, and training publicly available upon acceptance.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43904-9_40

SharedIt: https://rdcu.be/dnwHl

Link to the code repository

https://github.com/ChantalMP/Rad-ReStruct

Link to the dataset(s)

https://osf.io/89kps/

https://github.com/ChantalMP/Rad-ReStruct


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces the first structured radiology reporting benchmark dataset Rad-ReStruct and a novel method called hi-VQA for automating structured reporting. The proposed method contributes to the development of automated structured radiology report population methods, while allowing an accurate and multi-level evaluation of clinical correctness and fostering fine-grained, in-depth radiological image understanding.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Original way to use data: The Rad-ReStruct dataset provides fine-grained, hierarchically ordered annotations in the form of structured reports for X-Ray images. This approach allows for a more accurate evaluation of clinical correctness at different levels of granularity, focusing on levels with greater clinical importance.
    2. Novel formulation: The paper proposes a novel method for automating structured reporting, called hi-VQA, which leverages history context for multi-question and multi-level tasks. This approach allows the system to consider prior context in the form of previously asked questions and answers, which has a high accuracy and efficiency in radiology reporting.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lack of comparison to prior work on the new benchmark: The paper only produces a baseline on the new benchmark it introduces and doesn’t provide comparison to prior work on it.
    2. Limited dataset compared: The paper only compares hi-VQA to other state-of-the-art methods on VQA-Rad dataset. It does not provide a comprehensive comparison to all relevant prior work on other related dataset.
    3. Lower accuracy than current SOTA: Although the method the paper provided is novel, it doesn’t beat M3AE in accuracy on VQA-Rad.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducible

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Add comparison to prior work on Rad-ReStruct and on more related datasets Improve accuracy to a higher level, at least not less than the current SOTA

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall workload, the quality and quantity of its comparison, whether the results is SOTA.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a benchmark dateset RadStruct that provides fine-grained, hierarchically ordered annotations in the form structured report generation, which can benefit the further studies to develop more helpful reporting systems. This paper provides a hi-VQA architecture that considers prior context in the form of previously asked questions and answers for populating a structured radiology report.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper provides the first structured radiology reporting benchmark dataset that provides fine-rained, hierarchically ordered annotations.
    2. This paper presents a valuable hi-VQA architecture that considers prior context in the form of previously asked questions and answers for populating a structured radiology report, which can allow for an interactive workflow as well as improve the explainability.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The RadStruct dataset only uses medical images from the small-scale IU-Xray dataset with the modification of the corresponding text contents. To establish a new dataset, it would be necessary to consider more data in multiple chest X-ray datasets, including the widely-used MIMIC-CXR.
    2. Structured reporting may be a good attempt, but the original free-text reports in IU-Xray dataset often use diverse expressions to record a large amount of complex pathological information. Therefore, relying on a limited number of questions may sometimes not fully reflect the detailed pathological content, inevitably resulting in the loss of some pathological information during training.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The method proposed in this article is relatively clear and has good reproducibility. The main challenge is that the way of constructing the dataset has relatively poor transferability and relies heavily on medical issues in small-scale medical scenarios.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The contributions of this paper can be summarized as two aspects: 1) A benchmark dataset of structured medical report generation; 2) A hierarchical VQA model with the memory of history questions. However,there are also some drawbacks:

    1. The dataset is mainly based on a small-scale dataset IU-Xray in the field, ignoring the widely used MIMIC-CXR dataset;
    2. The proposed VQA model is relatively ordinary. When a question has many prerequisite questions, it is easy to reduce the impact of visual information and the current question on the answer at that moment. More feature fusion methods, such as pre-aggregate historical features in the form of attention mechanisms [1] or use special memory units [2], may be beneficial to integrate historical information.

    [1] Fenglin Liu, et al.Exploring and distilling posterior and prior knowledge for radiology report generation. CVPR 2021. [2] Zhihong Chen, et al. Generating radiology reports via memory-driven transformer. EMNLP 2020.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper introduces a benchmark dsataset of structured report generation, which can contribute to future research in the field of medical report generation.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The feedback addresses most of the issues, I maintain the original rating.



Review #3

  • Please describe the contribution of the paper

    This study introduces Rad-ReStruct, a novel benchmark dataset featuring fine-grained, hierarchically ordered annotations for X-ray images in the form of structured reports. The authors propose a new approach, hi-VQA, which models the structured reporting task as hierarchical visual question answering (VQA) and takes into account prior context from previously asked questions and answers to generate structured radiology reports.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Introduction of a new benchmark dataset: The authors present Rad-ReStruct, a valuable benchmark dataset that provides fine-grained, hierarchically ordered annotations in the form of structured reports for X-ray images. This dataset addresses the current gap in the research area and facilitates the evaluation and comparison of different methods for automating structured radiology reporting.

    2. Novel methodology: The proposed hi-VQA method models the structured reporting task as hierarchical visual question answering (VQA), considering prior context in the form of previously asked questions and answers. This innovative approach has the potential to improve the automated population of structured radiology reports.

    3. Strong evaluation: The authors conduct comprehensive experiments to demonstrate the effectiveness of hi-VQA, showing competitive performance on the medical VQA benchmark VQARad and providing a strong baseline on the newly introduced Rad-ReStruct dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Details of the algorithm is not very clean. (1) Inference and hierarchy handling: The authors should provide a more comprehensive description of how the model handles hierarchical VQA tasks, particularly how the inference process is interrupted and sub-questions are answered to ensure consistency and explainability in the predictions. (2) Training and evaluation details: The explanation of the teacher forcing technique, cross-entropy loss optimization, and evaluation method could be more detailed, with specific examples or illustrations to help readers better grasp the concepts and their significance in the overall methodology. (3) Ambiguity in feature encoding: The description of the feature encoding process for fusing image and text features can be improved by providing more explicit explanations of the steps involved, including how the various tokens and embeddings are combined and processed within the transformer layer. For example, you can show the shape of all the feature vectors in the hi-VQA framework. (4) The current presentation of Figure 2 could benefit from improvements in clarity and visual appeal, as it is difficult for readers to discern the hierarchical medical VQA process from the illustration. Enhancing the figure with additional details would be helpful. Specifically, it would be valuable to provide an explanation for the notation used, such as “H_Q,” to assist readers in understanding the figure’s components and their roles in the hierarchical medical VQA process.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    I think this paper can be reproduced if algorithm be made public available. It will be better to give more details to the algorithm description.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. Details of the algorithm is not very clean. (1) Inference and hierarchy handling: The authors should provide a more comprehensive description of how the model handles hierarchical VQA tasks, particularly how the inference process is interrupted and sub-questions are answered to ensure consistency and explainability in the predictions. (2) Training and evaluation details: The explanation of the teacher forcing technique, cross-entropy loss optimization, and evaluation method could be more detailed, with specific examples or illustrations to help readers better grasp the concepts and their significance in the overall methodology. (3) Ambiguity in feature encoding: The description of the feature encoding process for fusing image and text features can be improved by providing more explicit explanations of the steps involved, including how the various tokens and embeddings are combined and processed within the transformer layer. For example, you can show the shape of all the feature vectors in the hi-VQA framework. (4) The current presentation of Figure 2 could benefit from improvements in clarity and visual appeal, as it is difficult for readers to discern the hierarchical medical VQA process from the illustration. Enhancing the figure with additional details would be helpful. Specifically, it would be valuable to provide an explanation for the notation used, such as “H_Q,” to assist readers in understanding the figure’s components and their roles in the hierarchical medical VQA process.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novelty and significance: The paper introduces a new benchmark dataset (Rad-ReStruct) and proposes a novel method (hi-VQA) for hierarchical visual question answering in radiology reporting. These contributions have the potential to advance the field and improve the automated population of structured radiology reports.

    Methodological clarity: Ensuring that the methodology is clearly described and well-explained will help readers understand the approach and appreciate its effectiveness.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper introduces Rad-ReStruct, a novel benchmark dataset with fine-grained, hierarchically ordered annotations for X-ray images in the form of structured reports. The proposed hi-VQA method models the structured reporting task as hierarchical visual question answering (VQA) and incorporates prior context from previously asked questions and answers. The strengths of the paper include the introduction of the benchmark dataset and the comprehensive evaluation showing competitive performance. However, weaknesses include the lack of clarity in algorithm details, particularly in inference and hierarchy handling, training and evaluation processes, feature encoding, as well as the potentially poor generalization to open-ended settings that reflect detailed pathological content. Overall, the paper contributes to the field and is relevant to the clinical session, but clarifications are needed in addressing the weaknesses.




Author Feedback

We thank all reviewers for their constructive and valuable feedback and for acknowledging the value of our work in providing the first structured radiology reporting dataset modeled as a VQA task, Rad-ReStruct, which facilitates the development of structured reporting methods. They also recognized our method’s novelty (R1,R3) and valuable capacity incorporating history (R1,R2,R3), enabling interactivity and explainability (R2).

The primary motivation is to provide a public benchmark for structured reporting to foster research and comparability. Further, we set a baseline on Rad-ReStruct with our novel and effective hi-VQA method. As we introduce a new hierarchical VQA task, there are no prior methods for comparison (R1), thus we demonstrate hi-VQA’s effectiveness on VQARad. Our evaluation on VQARad (R1), shows the benefit of our hierarchical approach, which is orthogonal to works focusing on pre-training, such as M3AE, by achieving competitive results to M3AE and improving over the no-history baseline and all other prior works. Rad-ReStruct promotes further research and comparison to the proposed baseline.

Structured reporting mitigates issues like ambiguity and clinical correctness evaluation in free-text while ensuring completeness. It is endorsed by professional radiology societies such as RSNA and ESR and increasingly gains recognition in research and clinical practice [8,10,18]. Rad-ReStruct’s reports include detailed pathological content (MR), with abundant findings and attributes, making it a strong benchmark for this area. Potential limits (R2) can be tackled by creating comprehensive, tailored templates for clinical use cases, possibly with few free-text fields for additional comments, while preserving structured reporting benefits.

The expert annotations in IU-XRay are essential for creating a detailed and high-quality benchmark but unavailable for larger datasets like MIMIC-CXR (R2). We firmly believe Rad-ReStruct’s size (R2), which is a magnitude larger than e.g. VQARad, is sufficient to provide a much-needed first benchmark for structured reporting. While we rely on MeSH tags for construction, it is important to note that our main contribution lies in the dataset itself and not the way of construction.

Our model uses a transformer with input-specific token-type IDs facilitating an informed, attention-based feature fusion (R2). Also, it must rely on visual details for precise answers (R2) as prerequisite questions offer only high-level information.

We thank R3 for the helpful writing suggestions. If desired, we will add more details to our method and Fig. 2: Hierarchy Handling: At inference, questions are answered iteratively one by one. If a negative answer is predicted, all sub-questions are negated (set to No/No selection) and won’t be asked, ensuring consistency and enhancing explainability by tracking errors back to their source. Training: We use teacher forcing in training, relying on ground truth answers for prior questions. For the query “Is there pneumonia in the lung? Yes. What is the degree?”, the prior answer “Yes” comes from the ground truth, allowing parallel computation. For loss computation, we mask answer options unrelated to the current question and apply positive weighting per class. At inference, all answers are predicted, and no ground truth is used. Feature Encoding: We merge flattened image (Nx196x768) and padded text features (Nx259x768) into an Nx458x768 vector and assign specific token-type IDs for image, history question, history answer, and current question. We create a joint positional encoding (Nx768x458) by concatenating 1D and 2D encodings (each Nx384x458). Token-type and positional encodings are added to the feature vector.

Overall, as all reviewers acknowledged, we believe our work will add significant value for medical image analysis by introducing a novel VQA benchmark and method for structured reporting, opening new paths for research and evaluation of clinical accuracy.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Considering the authors’ rebuttal and their efforts to respond to the reviewers’ concerns, it appears that the rebuttal has adequately addressed the raised issues. While some weaknesses were identified, the novelty of the benchmark dataset and the potential value for research and evaluation of clinical accuracy are significant contributions. Therefore, I recommend accepting the paper, while suggesting that the authors incorporate the additional details provided in the rebuttal to clarify the algorithmic aspects mentioned by the reviewers.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have clarified most of the major concerns of reviewers, specifically algorithm details, inference and hierarchy handling, training process, and feature encoding, including the promise to add more details about the method and fig. I found the proposed method interesting and relevant to clinical practice and thus suggest acceptance.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Based on a thorough evaluation of the author’s rebuttal, I recommend accepting the paper. The rebuttal successfully addresses concerns related to algorithm details and novelty, ensuring the paper’s quality. The contribution of the paper to medical image analysis is valuable, as it introduces a novel VQA benchmark and method for structured reporting. This can pave the way for further research and evaluation of clinical accuracy, thereby adding value to the field.



back to top