Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Long Bai, Mobarakol Islam, Hongliang Ren

Abstract

The visual-question localized-answering (VQLA) system can serve as a knowledgeable assistant in surgical education. Except for providing text-based answers, the VQLA system can highlight the interested region for better surgical scene understanding. However, deep neural networks (DNNs) suffer from catastrophic forgetting when learning new knowledge. Specifically, when DNNs learn on incremental classes or tasks, their performance on old tasks drops dramatically. Furthermore, due to medical data privacy and licensing issues, it is often difficult to access old data when updating continual learning (CL) models. Therefore, we develop a non-exemplar continual surgical VQLA framework, to explore and balance the rigidity-plasticity trade-off of DNNs in a sequential learning paradigm. We revisit the distillation loss in CL tasks, and propose rigidity-plasticity-aware distillation (RP-Dist) and self-calibrated heterogeneous distillation (SH-Dist) to preserve the old knowledge. The weight aligning (WA) technique is also integrated to adjust the weight bias between old and new tasks. We further establish a CL framework on three public surgical datasets in the context of surgical settings that consist of overlapping classes between old and new surgical VQLA tasks. With extensive experiments, we demonstrate that our proposed method excellently reconciles learning and forgetting on the continual surgical VQLA over conventional CL methods. Our code is publicly accessible at github.com/longbai1006/CS-VQLA.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_7

SharedIt: https://rdcu.be/dnwOH

Link to the code repository

https://github.com/longbai1006/CS-VQLA

Link to the dataset(s)

https://endovissub2018-roboticscenesegmentation.grand-challenge.org/home/

https://endovissub2017-roboticinstrumentsegmentation.grand-challenge.org/

https://ai.stanford.edu/~syyeung/tooldetection.html


Reviews

Review #1

  • Please describe the contribution of the paper

    The article presents a novel taska surgical visual-question localized-answering (VQLA) system and enhances the ability of the model to migrate in new classes while maintaining catastrophic forgetting reduction through distillation training. The experimental part demonstrates the advancedness of the method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The overall structure of the article is clear and the experimental part is relatively complete.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Task: This paper newly defines a surgical visual-question localized-answering (VQLA) task, but both medical VQA and CL-VQA are common problems, and the task proposed in this paper seems to be simpler than CL-VQA and contains the same category information. Therefore, I have doubts about the significance and innovation of setting new tasks in this paper. 2)Innovation: plasticity-rigidity is a common perspective in incremental learning. Although there is an overlap between old and new classes in this task, the use of distillation methods was not found to be purposefully designed.If the old and new classes do not overlap at all, the migration ability of the model should be more demanding. What distillation difficulties are posed by the overlapping classes? The KL dispersion in the distillation loss in section 2.2 is also calculated according to the traditional old and new classes. Will the overlapping classes lead to the inability to learn the new classes? 3)Experiments: most of the work in the comparison experiments addresses incremental learning in images, which differs significantly from the tasks in this paper for medical video VQA. First, are the experimental results of these works reproducible on the dataset of this paper? Second, can the authors provide a comparison of methods for incremental learning in the VQA domain? 4) Details: Several examples of VQA for this paper are provided in the supplemental material, but the questions and answers appear to differ significantly from the natural images. Therefore, is it appropriate to use pre-trained ones from ImageNet for the image encoders in this paper? Has the feature extractor been fine-tuned with medical data? The comparison with incremental learning methods in natural images in the main experiment does not seem reasonable. 5)Writing: The writing logic of the essay is not clear. The abstract and introduction sections left me in doubt about the significance and difficulty of the new task presented in the paper. The description of the method section is almost identical to the conventional distillation study and it is difficult to see the specificity of the medical problem.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Good. The author provided the code in the supplemental material. I would like to know if the results of the other methods in the comparison experiment are reproducible?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    See “weaknesses” above. My questions are mainly focused on the following points: First: What is the significance and difficulty of the new task proposed in the paper? Compared to traditional incremental learning, it seems to be simpler. Second: The approach proposed in the paper is almost the same as the common incremental learning idea, how to react to the new task? Third: The comparison experiments of the paper are not compared with the medical task approach. Fourth: The writing of the paper needs to be improved.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My main challenge is the limited innovation of the article and the lack of analysis and improvements seen for the medical mission. See “weaknesses” for details.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents the Continual Surgical Visual-Question Localized-Answering (CS-VQLA) framework, which aims to address the issue of catastrophic forgetting in clinical applications by providing a non-exemplar continual learning (CL) method for surgical VQLA tasks. To achieve this, the authors propose two distillation methods, namely rigidity-plasticity-aware distillation (RP-Dist) and self-calibrated heterogeneous distillation (SH-Dist) and integrate them with the weighty aligning (WA) technique. Through these approaches, the CS-VQLA framework can effectively balance the rigidity-plasticity trade-off of deep neural networks (DNNs) and demonstrate improved performance in VQLA tasks. The proposed framework shows promise for real-world surgical education and scene understanding, given its ability to enhance performance in surgical VQLA tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The framework integrates two distillation methods, rigidity-plasticity-aware distillation (RP-Dist) and self-calibrated heterogeneous distillation (SH-Dist), along with a weight aligning (WA) technique to enhance the baseline framework’s performance. The proposed framework is effective in mitigating forgetting in continual learning.

    The authors presented their findings clearly, and their methodology was easy to follow, making the paper accessible to readers.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The experimental settings are not designed well. For the datasets, the authors constructed the continual procedure with three datasets including EndoVis18, EndoVis17 and M2CAI. But the authors don’t provide sufficient justification for their chosen order, why not choice to evaluate their CL performance on the public one? As the authors claimed to have realized a general framework for the VQLA task, there is a lack of evidence to support this assertion since the author only constructed a single procedure (t0->t1->t2) in their experiment.

    The rigidity-plasticity trade-off has been previously explored in the Continual Hyperparameter Framework [1], yet the author did not address such approaches nor demonstrated the novelty of their RP-Dist approach in comparison. Therefore, the author’s proposed framework for the VQLA task still requires further research and analysis to justify its effectiveness and uniqueness.

    [1] M. De Lange, et al., “A Continual Learning Survey: Defying Forgetting in Classification Tasks”, arXiv:1909.08383v4, 2021.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The work could be reproduce with the open source of authors’ self-annotate data.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    I recommend incorporating more comprehensive discussions on the novelty of the proposed framework and its contributions to the existing literature. Additionally, further discussions on previous works in this area would help contextualize the research and demonstrate the framework’s insight contributions.

    To support the generalization of the proposed framework, I suggest conducting additional experiments using different datasets and medical tasks, or using same datasets with different orders. These experiments would provide valuable insights into the framework’s applicability across various domains and highlight its potential for real-world implementation.

    I suggest that the authors discuss future work in more depth.

    In the second part of the experiment, author need more comparative experiment to draw the conclusion that each component is indispensable. For example, compare the result of models with two components and models with all components.

    In the analysis of the experiment result, author should provide the reasons for the proposed framework perform worse than other SOTA methods in some settings.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, this paper presents the Continual Surgical Visual-Question Localized-Answering (CS-VQLA) framework, which aims to address the issue of catastrophic forgetting in clinical applications.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The contribution of this paper is the proposal of a general continual learning framework for surgical visual question-answering (VQA) tasks. The proposed framework addresses the challenges of catastrophic forgetting, class increment, and domain shift, which are common in surgical education scenarios, and improves the performance of the VQA system. The framework includes several technical innovations, such as the RP-Dist and SH-Dist techniques, and is evaluated on multiple public datasets. The results demonstrate superior performance compared to several state-of-the-art methods, indicating the potential for this framework to be applied in surgical education and training across time and institutions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper introduces a novel framework for continual learning on surgical visual question-answering tasks, which can address limitations and challenges of existing approaches. The framework includes technical innovations such as RP-Dist and SH-Dist techniques, which specifically target catastrophic forgetting and improve performance on surgical VQA tasks. The proposed framework is extensively evaluated on multiple public datasets and demonstrates superior performance compared to several state-of-the-art methods. The evaluation includes both answering and localization performance metrics, providing a more comprehensive assessment of the proposed approach. The proposed framework has potential practical applications in surgical education and training, and the paper discusses several potential future directions for continued research and development in this area.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Dataset selection: The paper only evaluates the proposed framework on three public surgical datasets. It is unclear how the method would perform on other datasets or in different surgical settings.

    Lack of real-world testing: The experiments are conducted in a controlled environment and it is unclear how the proposed framework would perform in real-world surgical scenarios.

    Complexity: The proposed framework involves multiple techniques and components, which may make it more difficult to implement and integrate into existing surgical education systems.

    Resource requirements: The proposed framework requires pre-training on large-scale datasets and the use of powerful GPUs for training, which may be a limiting factor for smaller institutions or those with limited resources.

    Evaluation metrics: While the evaluation includes both answering and localization performance metrics, it is unclear if these metrics fully capture the effectiveness of the proposed framework in real-world surgical scenarios.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The proposed framework involves multiple techniques and components, which may make it more difficult to implement and integrate into existing surgical education systems.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The pseudo-label set is manually constructed and biased towards the pseudo-answer label, which may not always reflect the true underlying distribution of the data. This could lead to overfitting to the specific distribution of the old data and may not generalize well to new data or tasks.
    • choice of temperatures for the soft probabilities may not be optimal for all datasets or tasks. The authors used empirical values of 25 and 20 for Top and Ton, respectively, but these values may not be optimal for other datasets or settings. Tuning the temperature parameter can be challenging and may require trial and error or more sophisticated optimization techniques.
    • knowledge distillation can be sensitive to hyperparameters such as the learning rate, batch size, and regularization, which can affect the stability and convergence of the training process. Careful tuning and experimentation may be needed to achieve optimal results.
    • This approach assumes that the new and old data are drawn from the same distribution or at least have some overlap in the feature space. In practice, this may not always be the case, especially when dealing with highly heterogeneous or unstructured data. In such cases, more sophisticated transfer learning or domain adaptation techniques may be needed to effectively transfer knowledge between tasks.
    • The proposed method uses multiple loss functions and hyperparameters that need to be tuned and optimized to achieve optimal performance. This can be a time-consuming and challenging process, especially when dealing with large and complex models. Additionally, the effectiveness and generalization of the approach may depend on the specific choice of loss functions and hyperparameters, which may require careful experimentation and evaluation.
    • The authors have used robotic surgery datasets for their experiments, which may have specific characteristics and challenges that are not present in other domains. Evaluating the effectiveness of the approach on other types of data or tasks may require additional experimentation and validation.

    • the authors have split the datasets into training and test sets for each time period to avoid information leakage, which is a common practice in ML. However, it’s important to ensure that the test sets are representative and diverse enough to properly evaluate the performance of the model. Careful consideration of the dataset split and evaluation metrics is crucial to draw valid and meaningful conclusions from the experiments.
    • Could you elaborate on how the specific hyperparameters and optimization settings used in the experiments may affect the performance and generalization of the proposed approach?
    • How did you decide on the choice of hyperparameters and optimization settings used in the experiments, and did you perform any sensitivity analysis to investigate their impact on the results?
    • Have you considered any alternative evaluation metrics or dataset splits to ensure the validity and generalizability of the experimental results?
    • How do you plan to address potential limitations or shortcomings in the experimental setup and ensure that the proposed approach can be effectively applied to other domains or scenarios?
    • Have you conducted any experiments or analysis to investigate the effectiveness of the proposed approach on other types of clinical or medical tasks beyond surgical VQLA? If not, do you plan to explore the applicability of this framework to other domains?
    • How do you envision the potential deployment of this framework in real-world surgical education settings? What are some potential challenges or considerations for implementing this approach in practice?
    • Could you elaborate on some of the potential future directions or extensions for this work, such as incorporating other training systems or assessment methods into a comprehensive virtual teaching system?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Based on the strengths and weakness listed above.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors present their work on a general continual learning framework for surgical visual question-answering (VQA) tasks, tackling catastrophic forgetting, class increment, and domain shift. The framework includes several technical innovations, such as the RP-Dist and SH-Dist techniques, and is evaluated on multiple public datasets. The results demonstrate superior performance compared to several state-of-the-art methods, indicating the potential for this framework to be applied in surgical education and training across time and institutions.

    The strengths of the work include: 1) technical innovations such as RP-Dist and SH-Dist techniques, which specifically target catastrophic forgetting and improve performance on surgical VQA tasks though reviewers did question true novelty relative other works in fields outside of surgery as noted below 2) evaluation on multiple public datasets with comparison to SOTA approaches

    Weaknesses of the work that may be outweighed by the strengths but merit consideration and/or clarification by the authors: 1) The evaluation metrics can help with evaluating the model’s answering and localization performance, do the metrics fully capture the effectiveness of the proposed framework in real-world surgical scenarios as implied by the authors? Have the authors considered other metrics as in https://arxiv.org/abs/2206.01653. 2) Given the sensitivity of their approaches to hyperparameter tuning, reviewers have asked for clarification on how authors landed on the choice of hyperparameters and optimization settings used in the experiments and whether any sensitivity analysis to investigate their impact on the results was performed. 3) How do the authors square their work against other related works such as those on the rigidity-plasticity trade-off that has been previously explored in the Continual Hyperparameter Framework by M. De Lange, et al., “A Continual Learning Survey: Defying Forgetting in Classification Tasks”, arXiv:1909.08383v4, 2021? In the context of such work, how do the authors frame novelty?




Author Feedback

Thank the reviewers (R) for their critical assessment and insightful suggestions. We also appreciate the meta-reviewer (MR) for granting us the opportunity to clarify some major critiques as follows:

Justification on the new task and CL setup (MR, R1, R2, R3): In the medical/surgical domain, overlapping classes are a common issue in CL. We know and agree with the common theory (Rigidity-Plasticity trade-off) and will also include this paper in our discussion. However, we propose a new solution to solve the overlapping and non-overlapping classes which is a unique issue that only occurs in the surgical domain. Firstly, the model will not emphasize new classes and have a high bias toward overlapping classes rather than new classes. Overlapping classes will dominate in the model prediction if we naively follow the distillation from existing CL models. Secondly, catastrophic forgetting will be severe in old non-overlapping classes and the overlapping classes will dominate in the model prediction, and forget the old classes. Therefore, we construct our experiment setup with 3 different surgical datasets from 3 different centers. We maintain the CL in the surgical domain rather than using highly unstructured data.

Justification of the results and evaluation (MR, R2, R3): In data splitting, except for splitting by different videos, we also ensure that in the test set, all different answers are covered for evaluation.

We further add the balanced accuracy from the suggested reference as an additional evaluation metric. Results from the top methods in Step 0-1 are as follows: Metrics|Old N/O, Overlap, New N/O LWF|0.00, 20.13, 31.74 ICARL|0.00, 30.91, 42.76 Ours|0.00, 35.83, 45.06

Hyperparameter tuning (MR, R3): We have carefully tuned our hyperparameters and the ablation study on temperature is in the supplementary. We added our tuning progress on the loss combination, and some results in Step 0-1 are as follows: Metrics|Old N/O (Acc, mIoU), Overlap (Acc, mIoU), New N/O (Acc, mIoU) α=1, β=1, γ=10|1.53, 59.94, 56.20, 73.66, 78.26, 76.62 α=1, β=5, γ=5|0.00, 60.98, 56.05, 74.40, 73.91, 79.19 α=1, β=2.5, γ=5|0.00, 60.83, 56.46, 74.35, 78.26, 78.22

Reproducibility and more comparison (R1): Existing models can be easily reproduced on our VQLA setup, as our question-answering is a classification-based task. We only need to replace the backbone with a multimodal one (e.g., VisualBERT), and add a 3-layer MLP as the detector parallel with the classifier. We further compare our methods with 2 CL VQA papers, and show the results in Step 0-1: Metrics|Old N/O (Acc, mIoU), Overlap (Acc, mIoU), New N/O (Acc, mIoU) CLVQA (Lei et al, AAAI 2023)|0.00, 59.87, 51.83, 72.98, 65.22, 78.36 CLiMB (Srinivasan et al, NIPS 2022)|0.00, 60.16, 52.88, 72.99, 69.57, 77.37

More ablation (R2): Our ablation study in Table3&4 is to remove one component in our final solution, which is also the combination of baseline and 2 used components. We will further provide the results from the baseline with 1 component in the final manuscript.

Pretrained ResNet18 (R1, R3):

  1. Our main focus is to design a CL model from this rare perspective. The feature extractor can be easily accessible from PyTorch lib, and will not affect the results much.
  2. The ImageNet pre-trained model can also extract features well from RGB images in the medical domain (Yueming Jin et al., TMI 2021; Lalithkumar Seenivasan et al, MICCAI 2022).

Future work discussion (R2, R3): Firstly, our solution can conduct CL/class-incremental learning in any questions in surgical applications to solve the problem of overlapping/non-overlapping classes. This solution can also be applied when adapting a vision-language foundation model in the surgical domain. From the VQLA’s point of view, this system can serve as an effective auxiliary surgical training method that can give localized answers for better surgical scene understanding. We would provide a more in-depth discussion in the final manuscript.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Overall, the authors have provided reasonable responses to the perceived weaknesses of the paper. Given otherwise the reviewers’ comments and strengths, particularly from Reviewer 2 and 3, I believe there is sufficient novelty as noted in my initial metareview and the authors’ rebuttal addresses some concerns regarding the weaknesses (within reason for a rebuttal without needing extensive additional experimentation) that I would lean toward accept.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors present a continual learning framework for surgical VQLA tasks, while addressing catastrophic forgetting through distillation training. Extensive validation experiments are done on three public datasets, demonstrating promising performance compared to several SOTA methods. The reviewers highlight the effectiveness of the proposed approach, the comprehensive validation experiments, and promising potential for clinical applications as the key strengths of the work.

    The main concerns, including the significance and difficulty of the new task (compared to CL-VQA), novelty of the proposed approach and further discussion on previous work, and concerns regarding the comparison baselines and evaluation metrics have been addressed by the rebuttal to a large extent.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper presents a work in the field of Visual Q&A and continuous learning using endoscopic frame annotations. The strengths of the paper have been nicely summarized by the first meta-reviewer, which I completely support: methodological innovations, extensive comparison to SOTA and evaluation on public data bases. I found the author’s response to the critisism appropriate.They have demonstrated even more quantitative results (e.g. with the suggested metric ‘balanced accuracy), which support their claims.



back to top