Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Rui Wang, Sophokles Ktistakis, Siwei Zhang, Mirko Meboldt, Quentin Lohmeyer

Abstract

The surgical usage of Mixed Reality (MR) has received growing attention in areas such as surgical navigation systems, skill assessment, and robot-assisted surgeries. For such applications, pose estimation for hand and surgical instruments from an egocentric perspective is a fundamental task and has been studied extensively in the computer vision field in recent years. However, the development of this field has been impeded by a lack of datasets, especially in the surgical field, where bloody gloves and reflective metallic tools make it hard to obtain 3D pose annotations for hands and objects using conventional methods. To address this issue, we propose POV-Surgery, a large-scale, synthetic, egocentric dataset focusing on pose estimation for hands with different surgical gloves and three orthopedic surgical instruments, namely scalpel, friem, and diskplacer. Our dataset consists of 53 sequences and 88,329 frames, featuring high-resolution RGB-D video streams with activity annotations, accurate 3D and 2D annotations for hand-object pose, and 2D hand-object segmentation masks. We fine-tune the current SOTA methods on POV-Surgery and further show the generalizability when applying to real-life cases with surgical gloves and tools by extensive evaluations. The code and the dataset are publicly available at batfacewayne.github.io/POV_Surgery_io/.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_42

SharedIt: https://rdcu.be/dnwPn

Link to the code repository

https://batfacewayne.github.io/POV_Surgery_io/

Link to the dataset(s)

https://batfacewayne.github.io/POV_Surgery_io/


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a new dataset Egocentric Hand and Tool Pose Estimation during surgical activities. The paper argues the importance of the existence of large datasets for such a domain, for which the authors proposed a new pipeline to generate synthetic data. Evaluation of such synthetic data is done by fine tuning state of the art methods on this synthetic data. Results show the high impact the fine tuning produces which provides high quality measure for the synthetic dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed pipeline is very interesting, the optimization steps such as the contact loss is very logical to be used.
    • The paper is very well written and very easy to follow
    • Visualizations are very clear, in a vectorized format, and quite easy to understand without referring to the text.
    • Results seem convincing enough to show the high quality of the dataset
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • I would have preferred a summarized detailed view on the state of the art approach or/and the GrabNet used in the pipeline.
    • In figure 1.a, unless there is a mistake on my part, it should be Kinect not Mocap, because Mocap is the sensor suit.
    • The paper is missing mentioning the values of the loss weights, alpha beta gamma, used in the experiments. (unless mistake on my part and it was already mentioned)
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    All good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • When defining an abbreviation, its best to use capital letters, for instance in the introduction, ‘mixed reality (MR)’ should be ‘Mixed Reality (MR)’
    • More than one time in the paper i noticed sentences starting with numbers, it’s better to avoid that
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is very clear and well presented and has a high contribution to the community.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper presents a pipeline to generate a synthetic and egocentric dataset of diverse RGB-D sequences of manipulation of three surgical tools (within an orthopaedic surgical setup) for which 3D and 2D annotation of hand-object poses and 2D segmentation masks could be extracted. The results from three different state-of-the-art methods suggest more accurate hand pose estimation as well as better generalizablity to unseen datasets after fine-tuning on the generated dataset POV-Surgery.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper presents a comprehensive and effective pipeline for generation of synthetic sequences of surgical tool manipulation, taking into account the egocentric view of the surgeon, variations to the hand representation as well as fluid and realistic body motions and realistic grasp/body poses.
    • The suggested solution shows promising impact in dealing with lack of data and annotation in surgical workflow analysis problems.
    • The authors clearly explain and motivate the problem and propose a detailed set of well-presented and mathematically justified methods for developing a pipeline to resolve that. They support their methods further with a comprehensive set of experiments and results.
    • The qualitative and quantitative results on synthetic data suggest significant improvements when fine-tuning available SOTA on the generated dataset.
    • Qualitative results on real-life data indicate better generalizability of models fine-tuned on the synthetic dataset.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors refer to temporally realistic hand-object manipulation synthesis, while there is no data, evaluation or analysis for supporting such claim. That is, without observing the generated sequences or having a metric to evaluate the temporal smoothness of these sequences or poses especially considering the challenging task of generation of realistic video sequences, it would not be possible to judge whether the proposed dataset promotes temporally realistic sequences.
    • Although fine-tuned models seem to generalize much better on real-life samples, there is no ablation study on quantitative improvement of fine-tuned models on real-life data.
    • Discussion of qualitative and quantitative results can be further elaborated and improved .
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors mentioned that the code and dataset will be publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The paper is well presented and structured. There would be two major aspects needed to be addressed:
      1. Temporal consistency of the generated videos quantitatively and/or qualitatively (please check the weaknesses for more details)
      2. Quantitative evaluation of real-life experiments. Although the qualitative results show promise, an ablation study similar to table1 would bring a better understanding on the quantitative impact of fine-tuning on POV-surgery for real-life scenario specifically since figure 5.b. is also not well explained.
    • In Section 2.4 various glove textures, blood patterns and a synthetic room design are highlighted as scene/texture variations. However, some other variations such as occlusion, blur/noise, illumination, tool, skin and color variations are not mentioned.
    • Figure 5.b is not well analyzed. Also the PCP metric is not explained.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper shows interesting findings in a) generation of customized synthetic data b) usage of this data for having better and more generalizable models. There are however two main shortcomings: 1) lack of justification for the real-life representation of synthetized data specifically in terms of realistic motions and temporal consistency. 2) better quantitative presentation of real-life experiments. Considering that the dataset is not publicly available and the lack of benchmarks for comparison as well as the absence of any qualitative sequential data from the dataset, the similarity between generated dataset and real data (it would be preferable if the generated dataset covers more diversity than the real data and not necessarily customized on that setup) and whether the fine-tuned models can maintain the robustness for external test sets of similar manipulation tasks is questionable.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    Motivated by the lack of datasets for intraoperative egocentric hand and tool pose estimation with application context in the field of Mixed Reality, the authors present such a large-scale dataset that covers a variety of surgical glove textures and different metallic tools. In addition, a novel generalizable synthetic data generation pipeline capturing egocentric hand-object manipulation during surgical activities is being presented. Compared with other state of the art methods that only consider single-image datasets, the presented method synthesizes realistic temporal image sequences in order to ensure an accurate and reliable 3D pose estimation. Experimental assessments of the presented synthetic dataset in which existing state of the art methods are being used for model training and evaluation demonstrate the benefits of the novel dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This paper addresses a relevant need for more datasets that are used for egocentric hand and tool pose estimation. Using temporal image sequences of surgical hand-tool interactions instead of single images within the data generation pipeline is an interesting idea and offers the potential to create more accurate 3D pose estimation.
    • The overall structure of the paper is well thought-through and the use of sub-sections in sections 2 and 3 increases readability.
    • The availability of the underlying source code and the dataset itself is a huge benefit in terms of reproducibility and further development of hand and tool pose estimations for different tasks, also non surgical ones.
    • Mathematical equations are well described and easy to read, which increases reproducibility of the proposed methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The conclusion section is rather short and doesn’t contain a discussion part. The discussion part should address things that didn’t go well, and/or things to be improved in the future.
    • The fact that body motion is captured using four stereo cameras could be a constraint that limits reproducibility. The idea of first collecting body motion data and fusing it with hand posture data collected separately seems interesting, but it is a more complicated setup and involves additional hardware costs. This might just be a minor point, however, and potential benefits might outweigh these downsides.
    • Relevant implementation details such as the used OS, programming languages, used hardware (GPU, RAM etc.) are missing in the text, which reduces reproducibility.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • Availability of source code: The authors mention in the abstract that the code and the dataset will be publicly available. Therefore, before this paper gets accepted, the authors should provide links to the source code and the dataset. If this information is provided, the authors have done a great job to ensure reproducibility.

    • Implementation details: As already mentioned under the main weaknesses of this paper, some relevant implementation details are missing in the text, such as the used OS incl. version, used programming languages incl. version, hardware (GPU, RAM etc.). Only a few details are mentioned, for example that blender is used incl. bpycv packages, but the blender version is missing. The use of bpycv suggests that python was used as the main programming language, but this should be mentioned in the paper.

    • Hardware setup: As already mentioned under the main weaknesses of this paper, the fact that body motion capturing is required for the proposed pipeline could decrease reproducibility because this requires additional hardware (four stereo cameras). However, this might just be a minor point.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    There are a few minor things that should be changed:

    1.) Abstract, last sentence: “The code and the dataset will be publicly available for research purposes”: This is probably obvious, but I’ll mention it anyway: Since you wrote that the code and the dataset will be available, you should change this sentence and provide a link to the source code and dataset (github project or similar).

    2.) Section 2, Fig. 1: In the caption text below this figure the “shows the…” text occurs four times and is therefore a bit repetitive: “(a) shows the…”, “(b) shows the …”, “(c) shows the…” and “(d) shows the…”. I’d suggest coming up with different expressions rather than writing “shows the” four times.

    3.) Section 3, page 6: You are referring to “two state-of-the-art hand pose estimation methods”, and “one hand-object pose estimation”, but just list the references [18, 22] and [19] respectively, without further description of these methods. I’d recommend describing these three methods briefly. This increases readability.

    4.) Section 3, page 7, Fig. 4: The images are very small which makes it rather difficult to see all details of the hand and tool pose estimation. I’d recommend increasing the images and using the full page width.

    5.) Section 3, page 7, the sentence that starts with “Particularly, we observe similar performance improvement for …”. The grammar seems to be slightly incorrect here. I’d rather write “Particularly, we observe a similar performance improvement for …”, or “Particularly, we observe similar performance improvements for …”

    6.) Section 3, page 8, Table 1: Some of the acronyms being used in the table header like “MPJPE” and “PA-MPJPE” are difficult to read. I’d recommend using easier acronyms.

    7.) Section 3, page 8, Fig. 5: The same as in Fig. 4: The images are too small and it is difficult to see the hand hand pose estimations. I’d recommend increasing the images a bit and using the full page width.

    8.) Section 4, page 8: In the following sentence “, with 88,329 RGB-D frames, diverse bloody surgical …” an “AND” is missing. I’d recommend changing it to “, with 88,329 RGB-D frames and diverse bloody surgical …”

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, this is an excellent paper that addresses the very relevant need for surgical egocentric hand and tool pose estimation. The fact that a dataset and a data generation pipeline is provided offers other researchers the possibility to use this data for their own purposes and create their own synthetic data. However, the authors should provide links to the dataset and the source code before this paper gets accepted.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Strengths: own dataset Weaknesses: lack of demo video; unclear of the dataset construction and MoCap annotation: is it hand only or full-body, since the SMPL model is for full-body. As a hand pose (i.e. 3D hand skeleton) is considered as the output, what is the skeletal model like (e.g. how many joints)? Unclear of whether the surgical tool pose also part of the desired output. And if so, how to deal with it? How the method differ from the hand pose estimation or hand-object interaction papers in CVPR/ICCV?




Author Feedback

The authors would like to thank the reviewers (R) for their constructive feedback and suggestions. All the suggestions/concerns are summarized: Q1: lack of demo video; unclear of the dataset construction and MoCap annotation. MR Reply: Thanks for the constructive comments. We would include a demo video in the GitHub repo and the link to GitHub repo would be put in footnote on the camera-ready version. During body motion capture, we reconstruct the SMPL-X body sequences to model the hand pose evolution alongside with head and body movement and replace the fine-grained finger movement by the generated MANO hand-tool manipulation sequences. We prepare our hand pose and tool pose annotations the same as HO3D, which is a popular hand-object interaction dataset, for better usability: Our provided hand annotation for the dataset is MANO hand parameters and 21 MANO hand joints (15 basic MANO joints + 5 fingertips). And the tool pose annotation are 21 control points on the 3d bounding box of the tool mesh. The training modules would also be included in the code release to promote future research on our dataset. Moreover, we would release the head mounted camera trajectory and the utility functions to re-project the predicted and ground truth hand and tool mesh (with PyRender) / joints and control points (with OpenCV) to the image plane. So that the users could prepare the ground truth for their specific needs. Q2: How the method differ from the hand pose estimation or hand-object interaction papers in CVPR/ICCV. MR Reply: We point out the limited generalizability of the existing SOTA hand pose estimation and hand-object interaction methods to surgical scenarios due to the large domain gap between everyday cases and surgical cases. To address the technical and ethical challenges to construct surgical datasets, we propose our synthetic approach to generate surgically realistic dataset. After fine-tuning the existing SOTA methods targeted at everyday cases on our synthetic surgical dataset, we show their significant performance improvements on real-life surgical data. Q3: Interpretation of the PCP curve and quantitative analysis of the real life data. R2&R3 Reply: Thanks for the constructive feedback. The PCP stands for Percentage of Correct Poses and we would make that clear in the camera-ready version. The curve represents the error distribution on real-life data and contain several key metrics as AP@30, AP@50 etc. which stand for the average precision at error threshold of 30/50 pixels. We would make a table in the GitHub repo to make those metric and the mean error of each method more easily accessible. Q4: Temporal Realism of generate sequences. R3 Reply: Thanks for the comments. We guarantee the temporal realism by: applying smoothness constraint in body motion capture and smoothing the head mounted camera movement trajectory. Moreover, we would release a demo video containing reconstructed sequences as qualitative analysis. Q5: Short description about SOTA methods and GrabNet. R1&R2 Reply: Thanks for the suggestions. We would add a short description in the method section about the used methods for better readability in the camera ready version. Q6: Specification of loss weights of alpha, beta, gamma/OS &hardware set up/programming language/blender rendering script R1&R2 Reply: Thanks for the comments and we would include those details in the GitHub repo. In general, our proposed method is not computational intensive and is not dependent on specific hardware. Also the blender rendering script would be made public available for better reproducibility. Q7: Four stereo cameras set up for body motion capture limits the reproducibility. R2 Reply: Thanks for the observation. The body motion capture also could be performed with a single RGB camera using existing open source packages such as MMhuman3d.



back to top