Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Mingxian Yang, Yinran Chen, Bei Li, Zhiyuan Liu, Song Zheng, Jianhui Chen, Xiongbiao Luo

Abstract

Flexible ureteroscopy (FURS) navigation remains challenging since ureteroscopic images are poor quality with artifacts such as water and floating matters, leading to a difficulty in directly registering these images to preoperative images. This paper presents a novel 2D-3D registration method with structure point similarity for robust vision-based flexible ureteroscopic navigation without using any external positional sensors. Specifically, this new method first uses vision transformers to extract structural regions of the internal surface of the kidneys in real FURS video images and then generates virtual depth maps by the ray-casting algorithm from preoperative computed tomography urogram (CTU) images. After that, a novel similarity function without using pixel intensity is defined as an intersection of point sets from the extracted structural regions and virtual depth maps for the video-CTU registration optimization. We evaluate our video-CTU registration method on in-house ureteroscopic data acquired from the operating room, with the experimental results showing that our method attains higher accuracy than current methods. Particularly, it can reduce the position and orientation errors from (11.28 mm, 10.8°) to (5.39 mm, 8.13°).

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_12

SharedIt: https://rdcu.be/dnwOM

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This work presents a 2D/3D registration method that estimates the ureteroscopy camera poses relative to the CTU scans. Specifically, the method uses a vision transformer-based network to extract structural regions of the video images and generates virtual depth maps. The camera pose is estimated by optimizing a similarity function between the extracted region and depth maps.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- This paper is overall well written. The introduction is well motivated. The method is clear and easy to follow.
- The proposed method utilizes specific structures inside the kidneys to perform registration, which avoids the negative effect of the other anatomies. The idea of optimizing the intersection of predicted point sets and virtual depth map thresholded point sets is interesting.
- The evaluation is performed on clinical ureteroscopic data. The proposed method shows promising results compared to the related work.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- This method aims to estimate the camera pose in 3D, but this method extracts a binary segmentation mask from the video frame, which causes ambiguity when comparing to the “projected” mask. Please refer to my detailed comments below.
- This work only compares with one method in the literature (ref[7]), but overlooks a vast majority of learning-based endoscopic camera pose estimation work. Thus, the advantage of this method is limited.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The manuscript does not mention the publication of the code/dataset if paper is accepted.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- When solving the 3D pose estimation by comparing its projection mask in 2D, it causes both translation and rotation ambiguities. In translation, the ambiguity lies in the depth direction. In rotation, the ambiguity is the axial rotation. For example, if the structural mask is a rough circle at the image center, then the 3D camera can rotate according to the circle center axis and still generates the same 2D projection mask. The key is that the information of the video images is lost in the segmentation mask, such as textures, lights. Making use of these information will further improve the performance.
- In the abstract, the authors said that “our method attains higher accuracy than current methods”. However, current methods are not defined. It is better to specify the comparison methods or the state-of-the-art methods in the literature. In the last sentence, “it can reduce the position and orientation errors from ** to **”. It is better to rephrase as an improvement against the comparison method (ref [7]). The current phrase is confusing.
- In the introduction section, it says “internal structures such as calyx, pelvis, …, are difficult to be observed in the CT images”. This is not true, because pelvis is clearly visible in the CT images. This sentence needs to be rewritten.
- The designed structural point similarity/cost function (equation 4) is essentially a DICE score between two segmentation masks. It is worth comparing to the other 2D similarity metrics, such as gradient normalized cross correlation (Grad-NCC)
- The groundtruth pose data are annually generated by three experts. How are the camera poses manually annotated? The reviewer finds it necessary to include more details of the groundtruth data generation/annotation.
- In both abstract and conclusion, the error improvements are missing units (mm and degrees) for the comparison methods. The unis should be added to both.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presents a complete and novel video to CT registration workflow. However, the experiments and evaluations can be more rigorous.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper proposed a vision-based 2D/3D registration to estimate the camera pose for flexible ureteroscopic navigation. DPT-Base was employed to segment structures and stones inside the kidney. Virtual depth maps are generated from CT images using ray tracing. The 2D/3D registration is conducted by maximizing the interaction region between the kidney structure mask and the thresholded virtual depth map. Experimental results show that the proposed registration framework outperforms one baseline method by lowering the average distance error and orientation error of the estimated camera poses.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper applied the transformer in a novel way for segmenting kidney structures and stones from video images. Although similar frameworks have been developed for other endoscopic navigation tasks, the proposed vision-based 2D/3D registration has some interesting modifications to deal with specific challenges in ureteroscopic navigation, such as segmenting out stones for better registration accuracy.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

According to the experimental results, the proposed method is not robust enough for continuous ureteroscopic tracking.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Good
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. In the last paragraph of the Introduction, “this work shows the first study to continuously track the flexible ureteroscope… using a vision-based method”. [3] and [4] are also vision-based methods for ureteroscopic navigation.
2. Are CTU images contracted CT images? Are kidney stones visible in CTU? Please explain the advantage of using excretory-phase data for generating virtual depth maps from CTU.
3. The paragraph above section 2.4 mentioned in the paper, “To deal with these issues, we use the segmented stones as a mask to remove those regions with wrong depth.” Was a 3D segmentation method used to segment stones from CTU, or were they manually segmented?
4. In Fig.4, is the third row the matching virtual depth maps generated by the proposed approach? It would be nice to include example video images with stones to show the performance of the proposed method on corner cases.
5. Are those short red lines outliners in Fig.5?
6. How are the thresholds chosen to convert virtual depth maps to binary images? How sensitive is the registration accuracy to the threshold values?
7. Some tracking failure examples are provided in Fig 6. Apart from segmentation inaccuracy, will tissue deformation and initial pose used in the 2D/3D registration cause any tracking failure?
8. Please proofread the paper to fix any grammar issues.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposed a novel vision-based framework for ureteroscopic navigation which has clinical values.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

A practicable video-CTU registration pipeline to compute the pose of the ﬂexible ureteroscope relative to the CTU space.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The registration pipeline that utilizes virtual depth map from CTU is interesting.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The dataset is too small. All train and test frames are from three clinical cases; thus, the segmentation accuracy is expectable. However, when applying the method in new clinical case, the robustness of the algorithm will be greatly compromised.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The reproducibility of the paper is satisfactory, although the code is not open accessed.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

This paper describes a practicable video-CTU registration pipeline to compute the pose of the ﬂexible ureteroscope relative to the CTU space. Experimental results demonstrate its higher accuracy compared with a current work. The paper is well written easy to follow. Major strengths: 1) Using virtual depth map generated from CTU for registration is interesting. 2) The automatic and sensor-less tracking of ﬂexible ureteroscope is very challenging in clinical practice. The proposed pipeline seems works. Major weaknesses: 1) The dataset is too small. The training and test frames are from the same video, making the segmentation accuracy results less value. 2) How to obtain the ground truth is not clear. 3) The real time performance of the multi-stage registration pipeline is not clear. In contrast, it’s easy for some end-to-end deep learning-based registration work to achieve real-time tracking. Suggestions: 1) Collect more clinical data for training, and divide the dataset based on the video index. 2) More related works should be considered for comparison, especially some end-to-end deep learning-based methods.

In summary, I suggest a weak reject of this manuscript.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The dataset and the clinical value of this work.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

I maintain my original decision

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This paper presents a method using vision transformers to extract internal features of the kidneys on ureteroscopy videos and register them to virtual depth maps generated from CT images by the ray-casting algorithm. Some of the positives that the reviewers pointed out were the novel use of the transformers for segmenting the internal structures of the kidney from the ureteroscopy images, registration pipeline to align the segmentations with the virtual depth maps obtained from the CTU and the clarity of the work. However, some of the critical comments of the reviewers include: 1) the use of the mask that could result in position and rotational errors, lack of comparison to other 2D similarity metrics, 3) lack of details on the generation of the ground truth, 4) explanation of the choice of thresholds for creation of the binary mask, 5) sensitivity analysis for evaluating the robustness of the registration while compensating for tissue deformation, 6) lack of sufficient heterogeneity of the dataset, 7) lack of comparison to other SOTA methods, 8) lack of explanation on the real-time performance of the algorithm. In the rebuttal, I would like the authors to please address these points raised by the reviewers.

Author Feedback

We thank the reviewers for their time and efforts, and appreciate their constructive and valuable comments to improve our paper. We will revise our paper by major concerns that are addressed as follows. Reviewer#1 Q1: Ambiguities. Some information is lost in the mask, such as textures, lights. A1: Our method intentionally computes the similarity by the intersection points (pixel coordinates) from the real image and virtual map, which exactly aims to avoid these ambiguities. We did not use any pixel intensity information to compute the similarity since it usually introduces ambiguities in kidney images. This also explains why our method works much better than intensity-based DSSM [Luo et al.]. We also believe that combining intensity such as textures and lights into our function to create a hybrid cost function can improve the performance. We are working on that. Q2: Compare to other 2D similarity metrics A2: It’s interesting to compare ours to Grad-CNN. In [Luo et al.], NMI, Local-NCC, NSSD, and MoMSE were compared to DSSM. Ours works better than multi-scale DSSM. We will integrate Grad-CNN or DSSM in our point-based similarity function. Q3: Details of the GT generation A3: We generate GT by our developed software that can manually adjust position and direction parameters of the virtual camera to visually align endoscopic real images to virtual images. This procedure was time-consuming and labor-intensive. Reviewer#2 Q1:”…the first study to continuously track…using a vision-based” A1: [3] establishes a correspondence just between “holes” in CT reconstruction and two frames, which cannot generate a 3D pose of a frame, while [4] just matches the target with the images in the virtual-image database, which only contains 40 images. So, [3,4] cannot be qualified as a continuous tracking. Q2: About CTU A2: CTU is not contracted CT. CTU uses contrast dye to visualize the urinary system. Stones are invisible in CTU. Excretory-phase images can clearly show the urinary system. Using excretory-phase data can generate much higher quality virtual images than CT data. Q3: Stones segmentation A3: DPT-Base were used to automatically segment stones in ureteroscopic video images. Q4:Fig.4 & Fig.5 A4: While Row 3 in Fig.4 shows virtual depth maps generated by our method. These maps will be matched with images in Row 4. Red lines in Fig.5 were outliners. Q5: Threshold selection and sensitivity A5: We set threshold as [-1000, 120], where -1000 represents air and 120 was determined in accordance with physician’s experience and the characteristics of contrast agents. The upper threshold 120 has a significant impact on the registration accuracy. Based on our experiments, a change of 20 on the upper threshold results in remarkably decreasing the accuracy. Q6: Tracking failure A6: Big renal tissue deformation and inaccurate pose initialization can result in tracking failure.

Reviewer#3: Q1: Segmentation dataset A1: In segmentation, training data were 2 videos with 12101 frames and 9569 frames from two patients, while testing data were 4 videos with 12101 frames, 9569 frames, 8976 frames, and 25455, respectively, from 4 patients. Training and testing data were not always from the same video. Q2: GT generation A2: Please refer to A3 for Reviewer#1. Q3: Realtime performance A3: The segmentation was 20 frames per second. By GPU acceleration, the overall pipeline was 15 frames per second, approximating a real-time processing. Q4: Method comparison, end-to-end deep learning-based methods A4: We are striving for comparing current endoscope navigation methods that are not open-source, such as we already compared ours to [Luo et al. 2022]. It’s interesting to compare our method with other deep learning-based approaches. In [Luo et al. 2022], multi-scale DSSM works better than a deep learning-based method [Shen et al. 2019], while our method attains higher accuracy than multi-scale DSSM. We will compare with more deep learning-based methods.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have addressed the major comments of the reviewers. I am happy to recommend acceptance of the paper. Further validation of their results is necessary but in the current form, it is sufficient for MICCAI.

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

This work presents a 2D/3D registration method that estimates the ureteroscopy camera poses relative to the CTU scans. The overall writing and the main framework are good. However, I agreed with R3 that the validation dataset is quite small which cannot fully support the claims in this manuscript. Therefore, my final rating is reject.

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

From my appreciation of this review, the major concerns around this work raised during review were 1) possible limitations in method design; 2) limited comparisons to other methods, and 3) insufficiently robust performance.

The rebuttal provides answers to smaller specific questions, but from my appreciation, fails to 1) convincingly provide answers to these major shortcomings and 2) specify how these would be cleared up in a revised version of the manuscript.

As a consequence, and given all the information available to me, I fear that unfortunately, the manuscript cannot be accepted at this time.

back to top

A Novel Video-CTU Registration Method with Structural Point Similarity for FURS Navigation