Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Chen Lin, Shuai Zheng, Zhizhe Liu, Youru Li, Zhenfeng Zhu, Yao Zhao

Abstract

The robotic surgical report reflects the operations during surgery and relates to the subsequent treatment. Therefore, it is especially important to generate accurate surgical reports. Given that there are numerous interactions between instruments and tissue in the surgical scene, we propose a Scene Graph-guided Transformer (SGT) to solve the issue of surgical report generation. The model is based on the structure of transformer to understand the complex interactions between tissue and the instruments from both global and local perspectives. On the one hand, we propose a relation driven attention to facilitate the comprehensive description of the interaction in a generated report via sampling of numerous interactive relationships to form a diverse and representative augmented memory. On the other hand, to characterize the specific interactions in each surgical image, a simple yet ingenious approach is proposed for homogenizing the input heterogeneous scene graph, which plays an effective role in modeling the local interactions by injecting the graph-induced attention into the encoder. The dataset from clinical nephrectomy is utilized for performance evaluation and the experimental results show that our SGT model can significantly improve the quality of the generated surgical medical report, far exceeding the other state-of-the-art methods. The code is public available at: https://github.com/ccccchenllll/SGT_master.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_48

SharedIt: https://rdcu.be/cVRXn

Link to the code repository

https://github.com/ccccchenllll/SGT_master

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes to leverage scene graph and Transformer to generate surgical report, where DPP is used to obtain the prototype for encoding global relation, and a homogeneous graph is constructed to encode local relation. Extensive experiments on EndoVis18 is performed to validate the effectiveness of method.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The novel method which leverages the scene graph to generate surgical report
- Promising results achieved with thorough ablation study
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Some clarification should be made, see the detailed comments
- The evaluated dataset is quite small, therefore, cross validation is better to be conducted
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility is good, as the code is already released.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- How to define the Interaction Representation X^r, i.e. how to extract it from the raw image?
- Why only use X^r to calculate the relation memory M, instead of including X^v?
- The more intuitive explanation about why extract ‘interaction’ can make the graph change from heterogeneous one to homogeneous one? Meanwhile, why graph induced attention can capture the ‘local’ attention?
- If the scene graph is first generated by [9], I am wondering whether it shall require the extra annotation information. Is the ‘interaction’ annotated by the experts, as the original dataset does not contain such information?
- For the dataset, EndoVis18 actually show limited variety within each sequence, therefore 3 sequences for testing are relatively insufficient. The cross validation is better to be conducted when the dataset is small.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Interesting method with promising performance achieved
Number of papers in your stack

6
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This work proposes a Transformer-based architecture guided by scene graphs to approach the surgical report generation task. This paper uses scene graphs representing visual objects and relationships encoded using a Transformer encoder. In the attention layer, the key and value are expanded with a sample memory obtained using a k-Determinant Point Process [12, 15]. This strategy allows the use of global and local attention. Finally, it uses a meshed decoder [5] to generate the report from the resulting encoder rendering. This paper presents results in one benchmark dataset: Endovis2018.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The task of surgical report generation is relevant to the medical image analysis community.
- The method introduces technical novelties that improve the empirical results of the task.
- This work significantly outperforms the state-of-the-art in this task.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- In general, the clarity and organization of the paper could be improved. The method section could describe the model from input to output. These modifications will make it clearer for the audience. Additionally, the mathematical notation should always support the text.
- The ablation study does not allow an assessment of the model generalization capacity. Optimizing the architecture over the test sets might result in overfitting in the benchmark dataset. The ablation study should be performed in the validation set.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

According to the reproducibility checklist, the source code and pretrained models will be made publicly available, which is essential to guarantee the reproducibility of the results. Additionally, the method was developed using a public benchmark dataset for surgical report generation, which promotes research in the area.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- It is recommended to use self-contained tables and figures’ captions to transmit the information more clearly.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although the flaws in clarity and organization, the task of surgical report generation is relevant for the medical image analysis community. Additionally, the paper’s main contributions correspond to the technical novelty of the model supported in the empirical results that outperform the state-of-the-art. This work could be improved with some modifications, but its contributions support its acceptance.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper
This paper introduces a novel surgical report generation model vis scene graph-guided transformer (SGT).
1. The scene graph could extract the relation graph between tissues and instruments, which could accurately guide the report generation process.
2. A novel relation memory augmented attention is introduced to better interact between input videos and generated reports.
3. Graph neural network is adopted to better model the relations between tissues and instruments.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed approach is novel, which first generates a scene graph for each frame, and then uses the scene graph to guide the report generation. To solve the redundancy issue for the entire video, a determinant point process (DPP) is adopted to sample the most important frames. Besides, the graph neural network is used to encode the scene graphs.
2. The experimental results on the MICCAI 2018 challenge show that the proposed approach could significantly outperform the baselines.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Some components are not explained clearly. For example, in equation (1) what is the matrix Z? How to obtain Z?
2. There are some typos. (1) On page 3, section 2.2, line 3, either “the” or “a” should be used. (2) On page 4, in line 5, what does p mean in \mathcal{E}_{he}^{p}?
3. There are some missing references. This paper has mentioned several recent chest x-ray report generation papers as related works. However, please also consider citing the following two easiest works on chest x-ray report generation.
[1] Jing, Baoyu, Pengtao Xie, and Eric Xing. “On the Automatic Generation of Medical Imaging Reports.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2577-2586. 2018. [2] Wang, Xiaosong, Yifan Peng, Le Lu, Zhiyong Lu, and Ronald M. Summers. “Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9049-9058. 2018.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Reproducibility looks good.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

In general, the proposed approach is novel and interesting, and the experimental results could demonstrate the effectiveness of the proposed method. However, there are some unexplained parts and typos (as listed in the weaknesses).
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method is novel and interesting. The experimental results could demonstrate the effectiveness of the proposed method. The strengths outweigh the weaknesses listed in above.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The reviewers agree that this is an interesting work of sufficient novelty which outperforms the state-of-the-art. R1 and R2 raise issues about the size of the dataset use for performance evaluation and the ablation study, respectively. Clarifications requested by the reviewers should be addressed in the revised paper.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

2

Author Feedback

Thanks for the careful comments of all reviewers. Due to space constraints, we only focus on responses to the main comments. Response to R1: Q1.1How to define … A1.1We directly used the datasets provided in [25], where the interaction representation X^r was extracted by the method in [9]. Q1.2Why only use X^r … A1.2Unlike X^v that denotes the representations of visual objects, X^r is the set of all the representations of the interactive relationships between visual objects. In the caption generation task, the self-attention based on X^v will inevitably fail to model the a priori knowledge of the relationships between visual objects. Therefore, we introduce the relation memory M to enhance the intervention of interactive relationships in the encoder. The most straightforward way is to use X^r as M. However, due to the computational complexity and the over-smoothing of the learned attention, k-DPP is applied to sample X^r to obtain a rich diversified prototype subset of interaction representations to serve as M. In fact, if sampled from X^r∪X^v, M may contain information of visual objects which is not useful for modeling priori knowledge of the relationships.
Q1.3why extract ‘interaction’…why graph induced attention… A1.3As explained in Sec. 2.3, by the homogenization of heterogeneous graph, the extracted interactions in heterogeneous graph G_he also become parts of the nodes of the new constructed homogeneous graph G_ho together with the visual objects. Hence, different from G_he with edges consisting of multiple types of interactive relationships, the edges of G_ho will contain only binary links of the form 0-1. It means that some of traditional graph methods can be effectively applied to G_ho to perform other tasks, e.g., the attention obtained through G_ho in our work. Compared to the relationship-driven global attention that is applied indiscriminately to all input surgical images, graph induced attention is more capable of capturing the ‘local’ attention since it exploits effectively the information of the associated specific scene graph itself from an input surgical image. Here, the ‘local’ attention means that it is completely different for each specific input surgical image, i.e., the received attention in this way will be unique or local. Q1.4Is the ‘interaction’ annotated by the experts… A1.4Yes, the scene graph G_he is first available by [9]. In their work[9][25], the representations of the interaction between the instruments and the tissue were annotated by clinical experts with the help of the da Vinci Xi robotic system. Q1.5Experiments on cross validation. A1.5As you suggested, we further conduct the experiment of 5-fold cross validation on random division of the original dataset to alleviate the over-fitting in the case of small testing dataset. The obtained results (BleU-1:0.7566, CIDEr: 5.1139) are slightly lower than the original one, while the other baseline methods are also decreased. However, on the whole, our method is still superior to the others, and our CIDEr score is even about 105% higher than M2T. Response to R2: Q2.1The clarity and organization… A2.1We will make an improvement of our manuscript to make it clearer in the camera ready version. Q2.2Ablation study… A2.2We divided the dataset into training and testing sets and used the training set to optimize the model, which has nothing to do with the testing set. In order to avoid the overfitting as you mentioned, we further perform the ablation study by 5-fold cross validation. The experimental results are slightly lower than the original results. But it can still be seen as the effectiveness of M and Attn_g. Response to R3: Q3.1Some components… A3.1The matrix Z in Eq.(1) denotes the metric matrix of X^r, which is mentioned in page 4, line 15. Q3.2Typos. A3.2Thanks. E_he^p denotes the set of edges of the scene graph in the p^th image in N collected images. Q3.3References… A3.3Thanks. We will add these two references to the camera ready paper.

back to top

SGT: Scene Graph-Guided Transformer for Surgical Report Generation