Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jun Li, Shibo Li, Ying Hu, Huiren Tao

Abstract

Automatic radiology report generation is essential to computer-aided diagnosis. Through the success of image captioning, medical report generation has been achievable. However, the lack of annotated disease labels is still the bottleneck of this area. In addition, the image-text data bias problem and complex sentences make it more difficult to generate accurate reports. To address these gaps, we present a self-guided framework (SGF), a suite of unsupervised and supervised deep learning methods to mimic the process of human learning and writing. In detail, our framework obtains the domain knowledge from medical reports without extra disease labels and guides itself to extract fined-grain visual features associated with the text. Moreover, SGF successfully improves the accuracy and length of medical report generation by incorporating a similarity comparison mechanism that imitates the process of human self-improvement through comparative practice. Extensive experiments demonstrate the utility of our SGF in the majority of cases, showing its superior performance over state-of-the-art methods. Our results highlight the capacity of the proposed framework to distinguish fined-grained visual details between words and verify its advantage in generating medical reports.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_56

SharedIt: https://rdcu.be/cVVqc

Link to the code repository

https://github.com/LijunRio/A-Self-Guided-Framework

Link to the dataset(s)

https://openi.nlm.nih.gov/


Reviews

Review #1

  • Please describe the contribution of the paper

    They presented a self-guided framework to obtain the potential medical knowledge from text report without extra disease labels. It can assist the network in learning fine grained visual details associated with the text to alleviate the data bias problem.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The writing of the paper is very standardized and logically clear, and it is easy to read. (2) Figures and tables are very clear and detailed explanations are given. (3) The experimental part is adequate with detailed analysis.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) Many mature algorithms are used in the article, and the innovation is not outstanding enough. (2) How are multiple parameters in the network determined? Some clarification should be given.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper does not provide a link to the code. The description of the paper is relatively clear and generally reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    (1) Please clearly state the difference from previous studies. (2) The authors need to carefully discuss for which types of image samples your method is more effective and for which samples it fails. (3) The authors should state what the flaws of the method in this paper are and what future work is possible.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The structure of the paper is very clear, the introduction is very detailed, and it is easy to reproduce.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    • The paper proposed to use labels from unsupervised clustering of radiology reports to learn visual feature-extractor from medical images. This approach alleviates the need of supervised image-level labels to train the visual feature-extractor. • The learned visual features are then used in a transformer-based caption generator to write the corresponding report. • The report generator further used a supervised cosine similarity loss to compare the generated and ground-truth report.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • Unsupervised clustering acts as knowledge extraction from domain-specific knowledge source i.e, radiology reports. • The reuse of report-embedding extracted from sentence-bert in report generator for supervised cosine similarity loss is interesting.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • The paper lacks motivation on why the proposed approach is good for medical report generation. I understand the approach don’t require image-level disease labels, but the main building blocks of the models are not necessarily trained for medical domain. They are mostly pre-trained on natural images and just applied on medical data. • The experiments don’t evaluate performance on preserving negative mentions. A major differentiator between medical report generation and image-captioning on natural images is “negative mention” in medical reports. Two sentences can have high similarity score (calculated based on word overlap), while having an opposite polarity. For example, “Lungs have pleural effusion” vs “Lungs have no pleural effusion”. • The authors claimed to “extract fine-grained visual features associated with text” is supported by qualitative experiments in Figure 3. That is not enough to support the claim. • The heatmaps in Figure 3 not localized and covers most of the image. For “no” the heat map is highlighting upper lobe. It’s not clear how this is a good result? The “no” in the sentence is in context with pleural effusion and pneumothorax. Both diseases are in lower lobe regions as highlighted in the last heat-map. • The method section of the paper reads more like putting different blocks together and lacks motivation on the need and the design of those blocks. For example, knowledge distiller is required for unsupervised clustering of the reports, but why it is designed to have a Bert based embeddings? Are these embeddings better than other methods for medical reports. Does fine-tuning Bert on medical reports would help in getting better domain-specific embeddings? Why was dimensionality reduction used? Why not cluster the reports using the report embedding? Is there some transformation that UMap can learn but not Bert? Why HDBSCAN clustering was used? How to define number of clusters? How to make sure the clusters are distinct enough?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Paper will be reproducible if the authors release their code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    • It’s not clear what “data bias problem” the authors are referring to, throughout the paper. • In Knowledge distiller (KD), the dimension reduction is mapping the report embedding to a 2-d space. It is not clear why authors mapped to a 2-dimensional space. Are the UMap embeddings used in some 2D visualization of reports clusters? If the final goal is to do unsupervised clustering, then the authors should consider doing a hyper-parameter tuning to evaluate the best embedding dimensions before clustering. • The experiments don’t evaluate the quality of clusters extracted from knowledge clustering. Figure 3(c) is too small to draw any meaningful conclusion. • At multiple times, the manuscript lacks proper details and have very generic statements: o Recently, some works [7-9] have been proposed for more specific tasks. Which tasks? o Researchers have made a lot of attempts and efforts to fill these gaps. Which gaps? o SB is a multi-layer bidirectional transformer that has been well pre-trained on two large and widely covered corpora [19, 20]. Which corpora? o One reasonable explanation is that our method ignores some meaningless words and pays more attention to the long phrases used to describe diseases. Which meaningless words? Please provide examples. o KMVE can help the framework to distinguish image details and alleviate the data bias problem. It’s not clear how? • In Ablation study, how image-features are calculated for Base transformer without KMVE? • The text reads: For the meaningless words like “the”, “are” and “.”, our model assigns relatively low probabilities. While in Figure 3 (a), the probability of . is over 0.9. • The paper is over-loaded with abbreviations that makes it difficult to read.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is novel interns of its pipeline and application.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Novel use of mixed image and text based strategies to build robust models that can generate reports without the need for tedious pixel level annotation and image captioning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is written methodically and provides step by step description of the various stages of model development and demonstrates validation by help of visual representations ( Fig 3, heat maps) which allows easy co-relation between observing visual features and corresponding text generated.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The self guided framework is a novel and useful approach to overcome some of the described challenges, however few questions remain.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors describe the model development methodology in detail and demonstrate ablation study results. Would be ideal if access can be provided to code or tools used for study.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    1) How does the similarity comparer in RG overcome the image-text data bias? If one image can be reported in numerous ways by different radiologists, then which one are the model generated reports compared to? 2) The three step process- KD, KMVE and RG should work well for a given imaging modality and a specific lesion detection or classification task. Would it generalize across modalities or transfer to new tasks? To what extent would the model need retraining? 3) Does the number of neighbors and minimum distance set in Dimension Reduction have any bearing on why BASE+SC performs better than SGF for BLEU 1? Would setting larger number of neighbors allow KMVE to understand semantic relationships between distant words?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novel mechanism of using a mix of report and images for model training and using similarity comparer as QA step (alleviating need for humans for labelling or QA).

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The reviewers commented on several novel contributions such as using unsupervised clustering of radiology reports to perform knowledge extraction and learning of visual feature extractor, a supervised cosine similarity loss, etc. There are also several weaknesses pointed out by reviewers such as needing clarification on how network parameters are determined, lack of discussion on failure cases, lack of motivation, lack of ablation study/justification on the chosen blocks, etc.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2




Author Feedback

We are encouraged that R2 and R3 find our idea to be novel and useful. We are glad R1 and R3 found our writing logical and methodical. Below, we will clarify the key concerns of reviewers. Main concern ⑴ Bad cases analysis[R1,R3,MR]: Thanks for pointing out the lack of negative sample analysis. We will add more negative samples in the camera revision. ⑵ Flaws and future work[R1,R2,R3]: We propose a framework to generate better reports by learning potential knowledge labels. But these labels may not be aligned with human intuition. So making the report generation more explanatory is a topic that can dive into. Besides, generating reports is a “negative mention” task. It may not be a good metric to measure the reports’ quality only by evaluating the word overlap. Designing a metric that is more sensitive to negative samples is also a research topic. Finally, we just test our framework on x-rays. It is also an interesting direction to explore the performance of different modality data. ⑶ Parameters Selecting and its Motivation[R1,R2,R3]: When designing the framework, we focus on solving the image-text data bias problem to generate better reports, rather than doing a text clustering task. Therefore, we choose the common method which requires fewer parameters to fine-tune in the text clustering step. Next, we will explain some main concerns. ① Why Pre-trained Bert-based method? In many downstream tasks of NLP, it has been proved that Bert-based models trained in huge corpora have good generalization ability. Additionally, the design of Sentence-Bert is consistent with our purpose of comparing the similarity between sentences. ② Why reduce the dimension/map to 2-d space? The embedded vectors are very high-dimensional, which is computationally complex. Therefore, the usual way is to reduce dimensions before clustering. After testing several common dimension reduction settings (2,5,10), the 2-D results are more stable and easy to visualize. ③ Why UMAP and HDBSCN? UMAP and HDBSCN are commonly used in text clustering. HDBSCAN does not need to calculate the number of clusters and can cluster dense datasets. Besides, there are only two parameters that need to fine-tune in HDBSCN. The similarity matrix in Fig3(c) is close to the diagonal matrix which indicates the clusters are distinct. Since the space limitation, Fig3c is scaled and can be observed by zooming in. Individual concerns: To R2:Thanks! The motivation, parameters and “negative mention” are explained above. Due to space limitations, we will choose some key questions.①Data-bias problem: This problem is explained in the 2nd paragraph of the Introduction.②Heatmap not localized: The heatmap is not localized enough because KMVE used CNN as an extractor. This kind of method is indeed more dispersed when visualizing heatmap than the method which directly converts the image into patches.③How to train without KMVE? It has stated in Table 2 that KMVE and SC represent losses. If there is no KMVE, it means using Pre-trained ResNet50 as a backbone to extract image features.④”.” score is over 0.9: This score isn’t aligned with human intuition. So our statement is “relatively “. Making the report generation more explanatory is a further research topic.⑤Many generic statements: Thanks for pointing out that problem, we will revise them in the subsequent version. To R3:Thanks!①The SC module in RG is to make the predicted text closer to the real report. Image-text data bias problem is mainly alleviated by KMVE loss. If there are multiple reports, our method may tend to predict common sentences among reports. Since our dataset only corresponds to a single report, the hypothesis may need more datasets to be verified.②Theoretically, our method can be used in different modalities and disease types. It’s necessary to retrain the framework to adapt to specific downstream tasks.③No, BASE+SC doesn’t consider KMVE loss. It directly uses pre-trained Resnet50 to extract features.

Thanks to all reviewers!



back to top