List of Papers By topics Author List
Paper Info | Reviews | Meta-review | Author Feedback | Post-Rebuttal Meta-reviews |
Authors
Zhihao Chen, Yang Zhou, Anh Tran, Junting Zhao, Liang Wan, Gideon Su Kai Ooi, Lionel Tim-Ee Cheng, Choon Hua Thng, Xinxing Xu, Yong Liu, Huazhu Fu
Abstract
Medical phrase grounding (MPG) aims to locate the most relevant region in a medical image, given a phrase query describing certain medical findings, which is an important task for medical image analysis and radiological diagnosis. Existing visual grounding methods rely on general visual features for identifying objects in natural images and fail to take subtle and specialized features of medical findings into account, leading to sub-optimal MPG performance. In this paper, we propose MedRPG, an end-to-end approach for MPG. MedRPG is built on a lightweight vision-language transformer encoder and directly predicts the box coordinates of mentioned medical findings, which can be trained with limited medical data. To enable MedRPG to locate nuanced medical findings with better region-phrase correspondences, we further propose Tri-attention Context contrastive alignment (TaCo). TaCo seeks context alignment to pull both the features and attention outputs of relevant region-phrase pairs close together while pushing those of irrelevant regions far away, such that the final box prediction depends more on its finding-specific regions and phrases. Experimental results on three MPG datasets demonstrate that our MedRPG outperforms state-of-the-art visual grounding approaches by a large margin, and the proposed TaCo strategy is effective in enhancing finding localization ability and reducing spurious region-phrase correlations.
Link to paper
DOI: https://doi.org/10.1007/978-3-031-43990-2_35
SharedIt: https://rdcu.be/dnwLQ
Link to the code repository
N/A
Link to the dataset(s)
N/A
Reviews
Review #2
- Please describe the contribution of the paper
The paper proposes a framework for Medical Phrase Grounding (MPG), which aims to identify the relevant regions in medical images based on textual descriptions of medical findings. The main contribution of the paper is the introduction of the Tri-attention Context Contrastive Alignment (TaCo) strategy, which enables the model to learn more representations with region-phrase correspondences using both features and attention outputs. To account for the limited annotated data, a lightweight vision-language transformer framework is used.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
The paper is overall well-prepared and easy to follow.
-
Incorporating attention outputs from transformers into contrastive learning is an interesting approach. The results of the ablation study presented in Table 2 demonstrate a slight improvement with the inclusion of this component.
-
A lightweight vision-language transformer framework is used to accommodate limited annotated data.
-
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
-
The proposed method may have very limited practical value as it depends on manually annotated image-phrase-bounding box triplets.
-
The authors did not compare their method with unsupervised approaches, such as ConVIRT [1], GLoRIA [2], and BioVIL [3], which leverage large-scale image-text pairs in an unsupervised manner. Due to the abundance of unlabeled data, unsupervised methods are more practical and useful in real-world situations.
-
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
The reproducibility of the paper is credible.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
The main bottleneck in developing successful medical phrase grounding models is the limited availability of annotated datasets. The paper, unfortunately, does not address this issue. The authors should consider exploring unsupervised pre-trained vision-language transformer (VLT) models, as mentioned in the weakness section, or weakly-supervised approaches . They could then demonstrate whether their proposed approach can further improve performance by fine-tuning on a labeled dataset. Without addressing the challenge of limited annotated data, the contribution of this paper is limited.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
3
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The approach has limited practical value in clinical workflows.
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
3
- [Post rebuttal] Please justify your decision
After considering the rebuttal, my decision remains unchanged. I disagree with the author that large-scale annotation collection is a feasible approach for model development in medical image analysis. This approach is not only costly and labor-intensive but also restricts the downstream applications to the specific task, such as Medical Phrase Grounding (MPG) discussed in this paper. The authors have not made a clear explanation how MPG can enhance the diagnostic process. Given that the model takes radiology reports as inputs, which already include radiologist findings, the utility of MPG seems ambiguous.
Moreover, when comparing the performance of BioVIL on the MS_CXR dataset, the results appear to be inferior to those presented in the original publication. This raises concerns regarding the implementation and the validity of the experimental details.
Review #3
- Please describe the contribution of the paper
This method proposes a method for medical phrase grounding. The input of the method is a radiology image and text, and the output is a set of bounding boxes on the image corresponding to the text. The method uses a transformer architecture with a new tri-attention-context contrastive alignment loss.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-The method of the paper is clearly defined. Figure 2 is well-designed. -The problem of medical phrase grounding is interesting and very challenging it is currently underexplored and this paper makes a valuable contribution to
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
-Motivation behind some parts of the method are not clear. Are these solutions specifically designed to medical pghrase grounding? and why? Or is the method directly adapted from the general phrase grounding domain?
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
The paper is reproducible. It is not clear whether the code will be released
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
-The paper does not have enough section/subsection headers. This makes the paper quit hard to read. It is probably smart to cut some of the text to make the entire paper more readable -Fig 1: Adds limited value to the story in the introduction. -Fig 3: Why is the tag added to the images? as decoration?
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
6
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Good contribution and results
- Reviewer confidence
Confident but not absolutely certain
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Review #4
- Please describe the contribution of the paper
The paper proposes an approach for grounding textual data from medical reports to regions in the x-ray image. It formulates the problem as a region-phrase alignment problem and proposes an attention-based contrastive model to ground medical phrases. Results and ablations shown on 3 different datasets show efficacy of the approach.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
The paper proposes a novel approach to align medical phrases and regions in the x-ray image using feature-level and attention-level alignment objectives. Their joint loss function called Taco, which is based on InfoNCE enforces visual bbox features and attention to align with its corresponding phrase features and attention.
-
Three datasets are used to highlight the performance of the proposed approach. The model achieves state-of-art results on all 3 datasets. The ablations shows improvements over the VLT baseline as well as highlight the incremental improvements as the feature and attention level contrastive alignment are added as penalties.
-
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
-
The ablation study is limited to the different components of the loss function. Sensitivity to hyper parameters, ablations across bbox data available, generalization performance when tested on held-out datasets etc is missing from the paper. Given that there’s a page limit, this isn’t a big deal, but more ablations usually make the paper more convincing.
-
The difference in metrics wrt baselines and previous methods is within 1-2% on some datasets. Having confidence intervals on some of the numbers would also help show that the numbers in the paper are robust to stochastic variability in the model training and model selection process.
-
- Please rate the clarity and organization of this paper
Very Good
- Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
The paper has given ample details of their implementation process and seems reproducible. The code will be released after the acceptance.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- Please consider having confidence intervals on the metrics and testing on completely held-out datasets as a way to improve confidence in the approach. This is even more important where improvements over the baselines are within 1-3%.
- Showing the ablations for more datasets is another way to highlight the improvements in performance can be attributed to the newly proposed components.
- Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making
6
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper is proposes a new method for phrase grounding, is well written, and conducts experimentation to show performance improvements over multiple datasets. The ablations shown also show the performance improvement is coming from the penalties introduced in the paper. My vote is for an accept.
- Reviewer confidence
Very confident
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
6
- [Post rebuttal] Please justify your decision
The results with ablations and CIs added makes the results more convincing. I retain my vote for acceptance.
Primary Meta-Review
- Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
The paper proposes a framework for medical phrase grounding using the contrastive alignment. Reviewers find the paper well-written and easy to follow. The introduced tri-attention-context contrastive alignment loss and the incorporation of attention outputs into contrastive learning are considered interesting and novel. Experimental results and ablation studies showcase improvements over baselines. Reviewers suggest addressing the dependence on manually annotated data, the lack of comparisons with vision-language unsupervised pre-trained models, the need for further clarity through comprehensive ablations and confidence intervals.
Author Feedback
We thank the insightful and valuable comments from all the reviewers and address all the concerns as follows:
To R#2:
Q2.1 Availability of manually annotated data.
A2.1 Collecting annotations for medical phrase grounding is feasible in practice. Given a set of medical images and reports, phrases of medical findings can be automatically extracted from the reports by off-the-shelf natural language processing tools, and the image annotation only involves drawing bounding boxes, similar to lesion detection.
We believe that investing effort in annotations for medical grounding can yield significant benefits in a clinical setting, where report-image pairs are abundant. By using our MedRPG trained on ~1k cases, a radiology department can quickly convert historical reports into a large-scale annotated local datasets (~100k-1m cases). This approach addresses the challenge of developing in-house models or evaluating off-the-shelf AI models prior to deployment using local datasets.
Q2.2 Comparisons with vision-language unsupervised pre-trained models.
A2.2 We test unsupervised methods, BioViL and GLoRIA, for the grounding task. Table 1 presents their performance. While GLoRIA and BioVil achieve some reasonable results, they are much worse than our method. This suggests that unsupervised learning alone is insufficient for effective medical phrase grounding.
Table 1 Comparisons with BioViL and GLoRIA ======MS_CXR | ChestX-ray8 | In-house ======Acc | mIoU | Acc | mIoU | Acc | mIoU BioViL=|7.78 |19.19 | 6.56 | 12.78 | 3.65|13.55 GLoRIA|28.74| 31.17|8.58 | 16.39 |5.74 |14.91 Ours==|69.86|59.37 |36.02| 34.59 |49.87|43.86
To R#3:
Q3.1 Is the proposed method specific to medical phrase grounding?
A3.1 Our method is motivated for medical phrase grounding and not adapted from the general domain. Nevertheless, some of our solutions such as TaCo have the potential to be applied to general phrase grounding as well. Exploring the effectiveness of TaCo in the general domain could be an interesting future work.
To R#4:
Q4.1 Adding confidence intervals to demonstrate the improvements in performance.
A4.1 Thanks for the suggestion. We repeat our experiments 5 times with different random seeds and compute the 95% confidence intervals (CIs) for the avg. accuracy using the t distribution. We perform a t-test to determine the p-value between our MedRPG and the second-best method, TransVG.
MS_CXR TransVG:Acc=66.23,CI=[64.98, 67.48] Ours:Acc=69.7,CI=[68.85, 70.55],p-value=0.0002 ChestXray8 TransVG:Acc=33.43,CI=[30.95, 35.92] Ours:Acc=35.96,CI=[35.68, 36.24],p-value=0.0231 In-house TransVG:Acc=47.36,CI=[44.57, 50.16] Ours:Acc=50.13,CI=[48.43, 51.83],p-value=0.0466
The narrow CIs and low p-value demonstrate that our method’s performance is robust to stochastic variability during model training and selection. In the final version, we will include CIs for all the methods being compared.
Q4.2 More comprehensive ablation study.
A4.2 We study the impact of hyper-parameters (trade-off parameter \mu and number of negative samples K). Tables 2 and 3 show the avg. accuracy of our MedRPG method with varying hyper-parameters on the MS-CXR dataset. As can be seen, our method is not very sensitive to hyper-parameter choices.
Table 2. Ablation for \mu \mu==|Acc 0.1==|66.86 0.05=|69.86 0.025|68.86 0.01=|68.86
Table 3. Ablation for K K | Acc 3 | 67.66 5 | 69.86 7 | 66.67
Q4.3 Test on held-out datasets
A4.3 We assess the generalization capability of our method across different datasets by training on either the MS-CXR (termed M) or ChestXray8 (termed C) dataset and then testing on the other dataset. Table 4 presents the cross-dataset accuracy of MedRPG and TransVG. Consistently, MedRPG outperforms TransVG, highlighting the effectiveness of our approach in enhancing the performance of medical phrase grounding.
Table 4. Cross-dataset performance ====== train|test| Acc TransVG | M | C | 22.22 ====== | C | M | 20.96 Ours===| M | C | 24.24 ====== | C | M | 26.95
Post-rebuttal Meta-Reviews
Meta-review # 1 (Primary)
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The author rebuttal provides explanations and clarifications to the concerns raised by the reviewers, addressing issues such as the availability of manually annotated data, comparisons with unsupervised pre-trained models, and the need for comprehensive ablation studies. Based on the strengths of the paper, the novelty of the proposed method, and the improvements demonstrated over baselines, I recommend accepting the paper. The additional experimental results, including confidence intervals and ablation studies, contribute to the strength of the paper.
Meta-review #2
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
The paper is proposes a novel method for medical phrase grounding and presenting interesting experimental results. I agree with the reviewer on concerns with the limited discussion on the annotation cost and clinical evaluation on usefulness of the method. Nevertheless this paper has still merit and strengths outweigh weaknesses. I think this paper will be interest of MICCAI community. I would suggest ‘Accept’.
Meta-review #3
- Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.
By combining information from the original reviews, rebuttal and reading the submission by myself, my assessment is that this paper is a valid contribution in advancing in medical phrase grounding in chest X-ray diagnosis. This is an important task and authors provided a solution with good technical novelty. The experimental results also validate the contribution. The problem is not targeting on finding the pathologies at the first place directly but can be useful in training a large collection of patient data using reports or improve the interoperability of the radiology reports by other clinicians (since the bound box can be automatically generated, as in Figure 1). Authors do seem to understand/improve the clinical workflow.