Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Chenlu Zhan, Peng Peng, Hanrong Zhang, Haiyue Sun, Chunnan Shang, Tao Chen, Hongsen Wang, Gaoang Wang, Hongwei Wang

Abstract

Medical Visual Question Answering (Med-VQA) is expected to predict a convincing answer with the given medical image and clinical question, aiming to assist clinical decision-making. While today’s works have intention to rely on the superficial linguistic correlations as a shortcut, which may generate emergent dissatisfactory clinic answers. In this paper, we propose a novel DeBiasing Med-VQA model with CounterFactual training (DeBCF) to overcome language priors comprehensively. Specifically, we generate counterfactual samples by masking crucial keywords and assigning irrelevant labels, which implicitly promotes the sensitivity of the model to the semantic words and visual objects for bias-weaken. Furthermore, to explicitly prevent the cheating linguistic correlations, we formulate the language prior into counterfactual causal effects and eliminate it from the total effect on the generated answers. Additionally, we initiatively present a newly splitting bias-sensitive Med-VQA dataset, Semantically-Labeled Knowledge-Enhanced under Changing Priors (SLAKE-CP) dataset through regrouping and re-splitting the train-set and test-set of SLAKE into the different prior distribution of answers, dedicating the model to learn interpretable objects rather than overwhelmingly memorizing biases. Experimental results on two public datasets and SLAKE-CP demonstrate that the proposed DeBCF outperforms existing state-of-the-art Med-VQA models and obtains significant improvement in terms of accuracy and interpretability. To our knowledge, it’s the first attempt to overcome language priors in Med-VQA and construct the bias-sensitive dataset for evaluating debiased ability.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_36

SharedIt: https://rdcu.be/dnwyP

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The authors propose a new model named DeBiasing Med-VQA model with CounterFactual training (DeBCF) to address the issue of language bias in existing Med-VQA models. The authors also introduce a new dataset called SLAKE-CP through regrouping and re-splitting the train-set and test-set of SLAKE into different prior distributions of answers. The proposed method outperforms existing state-of-the-art Med-VQA models and improves accuracy and interpretability. To my best knowledge, this is the first attempt to overcome language bias in Med-VQA and construct a bias-sensitive dataset for evaluating debiased ability.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper explores the issue of language bias in Med-VQA tasks and proposes a solution using counterfactual examples. The use of these counterfactual training samples seems to improve the model’s ability to recognize clinic objects, especially when they have a clear and well-defined causal relationship. Overall, the paper appears to address an important research problem and offer a promising solution.
2. Furthemore, the authors construct a linguistic-bias sensitive Med-VQA dataset SLAKE-CP. This dataset can be used as a benchmark for future research on debiased Med-VQA models. By using this dataset, researchers can evaluate the effectiveness of their debiasing techniques and compare their results with the state-of-the-art models. This will help in improving the quality of Med-VQA models and reducing the impact of language bias on clinical decision-making.
3. Both the quantitative and qualitative experimental results demonstrate the effectiveness of their proposed method for debiasing Med-VQA models. They have shown that their method outperforms the state-of-the-art models in terms of accuracy and interpretability. Additionally, the authors provide explanations for their results in Figure 3 and 4, which demonstrate the improvement in interpretability of their proposed method. Overall, the results indicate that the proposed method is a promising approach to mitigate the effects of language bias in Med-VQA and improve the quality of clinical decision-making.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. It is still not clear how to subtract the language bias form the total casual effect for counterfactual training as shown in Figure 2 (c)&(d).
2. The connection between TIE&NDE with the objective Eq.(7) is not clear and well explained.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Data sets are publicly available. If authors provide source code and experimental details, the reviewer will be confident in the reproducibility.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. Figure 2 is too busy. It would be better to separate the sub-figures following the related context, or reorganizing the figures to make them more clear and readable.
2. See weaknesses. These two problems require further clarification
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper proposes a novel solution for the bias issue in Med-VQA. The constructed dataset is likely to be a new benchmark in the field of Med-VQA. The reported experimental results demonstrate the effectiveness of the proposed approach.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper proposes a novel methodology for addressing language priors in medical visual question answering (Med-VQA) task, which consists of two parts: counterfactual training data preparation and counterfactual causal effect training. The counterfactual training data preparation aims to implicitly weaken the language bias by preparing counterfactual training samples for improving the sensitivity of clinic objects. The counterfactual causal effect training aims to explicitly reduce the linguistic bias/priors by treating the language bias/priors as the counterfactual causal effect and subtracting it from the total causal effect with counterfactual training. Additionally, they present a newly splitting bias-sensitive Med VQA dataset, Semantically-Labeled Knowledge-Enhanced under Changing Priors (SLAKE-CP) dataset through regrouping and re-splitting the train-set and test-set of SLAKE into the different prior distribution of answers.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper discovers most existing Med-VQA works are neglected the cheating factors, which typically resort to linguistic distributions priors.

This paper proposes a novel methodology that implicitly weaken the language bias by preparing counterfactual training samples for improving the sensitivity of clinic objects.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

In discussion part, what’s the ratio of counterfactual examples in counterfactual training? How is impaction of different ratios to the results?

The writing need to be improved. 1). In SLAKE-CP: Construction and Analysis section, the Re-Splitting process is not clear enough, it is recommended to illustrate in picture or rephrase.2. The expression of Figure 2 is not clear enough, different sub-figures mixed together in one Figure. It is recommended to draw it separately to improve the clarity.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

This work can be reproducible after authors open their dataset and source code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

The authors should carefully check the typos before resubmitting their revision. Please reference the comments of weakness.

The paper would benefit from a more detailed discussion on the generalizability of the proposed approach.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This manuscript points out the language models’ bias on the training data and provide a good solution for it. But, experimental analysis is not enough, more detailed exploration (what’s the ratio of counterfactural examples in training and its impaction) should be done to make it more complete. The writing of this manuscript also needs polishing (i.e.,four sub-figures in Fig.2, re-splitting in Section3, typos, etc.).
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper
1. The paper contributes to the field of Medical Visual Question Answering (Med-VQA) by proposing a novel DeBiasing Med-VQA model with CounterFactual training (DeBCF) to comprehensively address the issue of language priors.
2. The proposed DeBCF model generates counterfactual samples and formulates language prior into counterfactual causal effects, aiming to improve the model’s sensitivity to semantic words and visual objects while reducing biases. Additionally, the paper introduces a new bias-sensitive Med-VQA dataset called Semantically-Labeled Knowledge-Enhanced under Changing Priors (SLAKE-CP), which facilitates the evaluation of debiased ability in Med-VQA models.
3. The experimental results demonstrate that the proposed DeBCF model outperforms state-of-the-art Med-VQA models in terms of accuracy and interpretability.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper addresses a crucial issue in Medical Visual Question Answering (Med-VQA), which is the reliance on superficial linguistic correlations, leading to potentially inaccurate clinical answers.
2. The proposed DeBiasing Med-VQA model with CounterFactual training (DeBCF) aims to mitigate language priors both implicitly and explicitly, offering a comprehensive solution.
3. The creation of the Semantically-Labeled Knowledge-Enhanced under Changing Priors (SLAKE-CP) dataset is a valuable contribution, as it enables the evaluation of debiased ability in Med-VQA models.
4. The experimental results demonstrate that the proposed DeBCF model outperforms existing state-of-the-art Med-VQA models in terms of accuracy and interpretability on two public datasets and the SLAKE-CP dataset.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The paper should include a comparison of the proposed DeBCF model with other debiasing techniques or methods that address the issue of language priors in the context of VQA.
2. When discussing the interpretability results, consider providing more quantitative analysis (e.g., model caliberation) on all the test set rather than just list few qualitative samples.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

I believe this work can be reproduced if the code is given.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. Comparison with other debiasing techniques or methods: To strengthen the paper, it is important to provide a comparative analysis of the proposed DeBCF model with other debiasing techniques or methods that address the issue of language priors in the context of VQA. This will help establish the novelty and effectiveness of your approach in comparison to existing solutions. There are some advice: (1) Provide a brief literature review of existing debiasing techniques or methods, highlighting their strengths and weaknesses in addressing language priors in VQA. Clearly explain how the proposed DeBCF model is different from or improves upon these existing techniques. (2) Include additional experiments comparing the DeBCF model with these methods on the same datasets, using the same evaluation metrics. This will allow a fair comparison and showcase the advantages of the proposed method. There are several papers focus on debiasing general visual question answering: (a) Cadene, R., Ben-younes, H., Cord, M., & Thome, N. (2019). RUBi: Reducing Unimodal Biases for Visual Question Answering. NeurIPS 2019. (b) Li, M., Liu, S., & Zhu, X. (2021). Reducing Language Bias in Visual Question Answering using Visio-Linguistic Fusion. The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).
2. Quantitative analysis for interpretability: While qualitative examples and visualizations are useful for understanding the improvement in interpretability achieved by the proposed DeBCF model, it is essential to provide more quantitative analysis to support your claims. There are some advice: (1) Consider using model calibration as a quantitative measure to assess the interpretability of your model on the entire test set. Model calibration measures the reliability of a model’s predicted probabilities, with well-calibrated models providing accurate estimates of the true probabilities of the predicted outcomes. You can refer to this paper [Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. ICML 2017.] (2) In addition to the accuracy metric, report these interpretability metrics for the proposed DeBCF model and compare them with the existing state-of-the-art Med-VQA models to demonstrate the improvement in interpretability. You may refer to the following paper especially it is also about medical visual question answering. [Gong, H., Chen, G., Mao, M., Li, Z., & Li, G. . VQAMix: Conditional triplet mixup for medical visual question answering. IEEE Transactions on Medical Imaging.] (3) You may also explore other interpretability metrics or methods, such as Local Interpretable Model-agnostic Explanations (LIME) or Shapley Additive Explanations (SHAP), which can provide a quantitative measure of feature importance for individual predictions.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. Novelty: The proposed DeBCF model is a new approach to address the issue of language priors in the context of Visual Question Answering (VQA). The comparison with other debiasing techniques or methods is necessary to establish the novelty of the proposed method in this domain.
2. Interpretability: The paper demonstrates improvements in interpretability using the DeBCF model. However, the provided analysis is mostly qualitative. Including a more comprehensive quantitative analysis, such as model calibration, would significantly strengthen the paper’s claims about interpretability.
3. Comparison with existing works: A comparative analysis of the DeBCF model with existing debiasing techniques or methods in VQA is missing. By providing a literature review, clearly explaining the differences or improvements over existing techniques, and including additional experiments for a fair comparison, the paper will be able to better showcase the advantages of the proposed method.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

The authors have made the comparison with other related debiasing methods and providing the quantitative analysis on model’s interpretability. I think the current paper is much better than the previous one.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The work addresses an important research problem of shortcut learning by counterfactual training for medical visual question answering. While there are merits in novel methods and performance improvement, reviewers raised major concerns on- (i) missing comparison with other closely related debiasing methods (R3); (ii) unclear explanation about the causal effect (R1) and the ratio of counterfactual examples (R2); and (iii) some Figures and equations are not clear(R1, R2).

Author Feedback

We highly appreciate all the valuable comments from the meta-reviewer and the reviewers, we have tried our best to address all the comments. Replies to Meta-R: (i)Please refer to replies 5,6. (ii)refer to replies 2,3. (iii)refer to replies 1,4. Replies to R1: 1.Unclear Fig. 2(c)(d) We have reconstructed Fig. 2 (c)(d) by highlighting the causal effect of language bias on the answer and directly adopting the minus sign to represent subtracting process in the revision.

Unclear explanation about TIE&NDE with Eq.(7) The Eq.(7) represents the complete training objective which combines the causal effect of training loss on both the original data and generated counterfactual data. The TIE reflects the reduction of language bias by subtracting the NDE from the TE. In inference stage, we choose the answer with the maximum TIE as the prediction. Replies to R2: 3.The ratio of counterfactual samples and its impaction In our work, we generate corresponding counterfactual samples for each original data pair and control the impaction through the ratio of training loss as the (1-\alpha) in Eq.(7). We have conducted the influence results of the ratio in Table 5 as well as the Table 3 in Suppl, which reveals that the ratio=0.4 achieves the best performance. The comparisons imply that proper counterfactual samples can implicitly improve the sensitivity of features and weaken the language bias, and increasing the samples may deviate from the original semantic representations. 4.1)Re-Splitting of SLAKE-CP 2) Unclear expression of Fig. 2 We have carefully checked the typos and polished the writing. 1) Rephrase of re-splitting: we first assign 1 group into the test-set. Among the remaining groups, if there is a group with a different question type or answer from the groups in test-set, this group will be assigned to test-set otherwise to train-set. If the test-set reaches 1/7 of the whole set, the remaining groups are added to train-set. 2)We have split Fig. 2 and drawn the sub-figures separately for improving clarity. Replies to R3: 5.Comparison with other related debiasing methods We have conducted comparisons of the DeBCF with other debiasing models, including the recommended RUBi and LPF[a],GGE[b]. (The second recommended paper could not be retrieved). Although the above methods can effectively reduce language bias, they reckon without visual-linguist explicable information and contrarily weaken the inference ability. For ours, we explicitly subtract the language bias through causal effect and generate counterfactual samples to implicitly improve the sensitivity of clinical words and visual objects for inference. The experiments on VQA-RAD, SLAKE and SLAKE-CP datasets (Open/Closed/Overall accuracy): RUBi:42.4/73.2/61.5; 75.1/77.6/75.8; 12.2/26.9/26.4 LPF:41.7/72.1/60.9; 74.8/77.8/74.9; 13.1/29.7/30.2 GGE:44.6/74.5/63.8; 76.4/78.7/76.6; 13.9/30.9/30.8 Ours:58.6/80.9/71.6; 80.8/84.9/82.6; 18.6/35.4/34.2 [a]LPF:A Language-Prior Feedback Objective Function for De-biased Visual Question Answering.SIGIR ‘21. [b]Greedy gradient ensemble for robust visual question answering.ICCV 2021. 6.More quantitative analysis We exactly adopt model calibration as a quantitative measure for interpretability. We calculate the Expected Calibration Error and Maximum Calibration Error and divide all predictions on test-set into 15 bins of the same size. The ECE/MCE results are: MFB:0.39/0.77 SAN:0.36/0.75 BAN:0.34/0.70 MEVF+SAN:0.31/0.69 MEVF+BAN:0.28/0.67 CLIPQCR:0.22/0.48 CPRD+BAN:0.20/0.39 VQAMix:0.13/0.28 Ours:0.10/0.21 We also conduct a quantitative measure for individual predictions through SHAP and LIME. For example, our model reports “What, organ, largest” with the largest SHAP value “0.09,0.15,0.4” respectively and image feature with the highest SHAP value (0.0002~0.0004) outlines the brain organ which precisely matches the answer “Brain”. While the VQAMix corresponds to “-0.01,0.10,-0.25” and (-0.0001~0.0002) words-image SHAP value under the same condition.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have thoroughly clarified all major concerns of reviewers, specifically unclear figures, equations and additional explanations about the causal effect and counterfactual examples. The rebuttal also offers additional quantitative analysis and performance comparison pointed out by the reviewers and meta-reviewer. I would like to thank the authors for their convincing responses and willingness to add them to the final submission. This is an interesting work and of interest to the MICCAI community, and thus, I suggest acceptance.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

All reviewers agree that this paper proposes a novel method and give positive evaluations (2 weak accept and 1 accept). Previous concerns have been successfully addressed during the rebuttal. This will be of interest to the MICCAI community. I suggest ‘Accept’

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The added comparison and other factors in the rebuttal reduced the concerns of the original reviewers.

back to top

Debiasing Medical Visual Question Answering via Counterfactual Training