Authors

Pragyan Shrestha, Chun Xie, Hidehiko Shishido, Yuichi Yoshii, Itaru Kitahara

Abstract

Intraoperative fluoroscopy is a frequently used modality in minimally invasive orthopedic surgeries. Aligning the intraoperatively acquired X-ray image with the preoperatively acquired 3D model of a computed tomography (CT) scan reduces the mental burden on surgeons induced by the overlapping anatomical structures in the acquired images. This paper proposes a fully automatic registration method that is robust to extreme viewpoints and does not require manual annotation of landmark points during training. It is based on a fully convolutional neural network (CNN) that regresses the scene coordinates for a given X-ray image. The scene coordinates are defined as the intersection of the back-projected rays from a pixel toward the 3D model. Training data for a patient-specific model were generated through a realistic simulation of a C-arm device using preoperative CT scans. In contrast, intraoperative registration was achieved by solving the perspective-n-point (PnP) problem with a random sample and consensus (RANSAC) algorithm. Experiments were conducted using a pelvic CT dataset that included several real fluoroscopic (X-ray) images with ground truth annotations. The proposed method achieved an average mean target registration error (mTRE) of 3.79+/1.67 mm in the 50th percentile of the simulated test dataset and projected mTRE of 9.65+/-4.07 mm in the 50thpercentile of real fluoroscopic images for pelvis registration. The code is available at https://github.com/Pragyanstha/SCR-Registration.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43999-5_74

SharedIt: https://rdcu.be/dnwxr

Link to the code repository

https://github.com/Pragyanstha/SCR-Registration

Link to the dataset(s)

N/A

Reviews

Review #2

Please describe the contribution of the paper

This work presents a learning-based method that performs 2D/3D registration using a single X-ray image. A convolutional neural network (CNN) is trained on simulated X-ray images from preoperative CT scans, which regresses scene coordinates of a target X-ray image. The pose is estimated by solving a PnP problem from the predicted coordinates. The method is evaluated on real pelvic X-ray images.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- This paper proposes a method that regresses the object’s scene coordinates from a single X-ray images, and applies the method for single-view 2D/3D registration. The method was evaluated on both synthetic images from six cadaveric specimens CT scans and real images.
- This work performs comparison studies against related work. The experiment results show the proposed method is better than direct pose regression method (PostNet) and anatomical landmark detection based method (DFLNet), suggesting the proposed method’s advantage of global spatial structure reasoning and independence of landmark visibility. Especially the proposed method performs well when there is only partial object visible in the X-ray image.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- This method essentially extends the idea of corresponding landmark coordinate detection to predicting 3D ray casting intersection coordinates of all corresponding object points in an image. Thus, the technical innovation of this method is limited.
- The generalization ability of the method is questionable. Regressing 3D coordinates from a single-view image is an ill-posed problem. The learning-based method is prone to be overfitted to the training data. Both simulated and real images are from the same source of 6 cadaveric specimens, and the experiments were not kFold cross-validated. The performance of the proposed method on unseen data is suspectable. Please see the detailed comments below.
- The writing quality of this paper is poor. There are many typos in the figure, text, and formula. A few sentences have grammar issues and are confusing. Please refer to the detailed comments below. The reviewer finds it necessary to do a complete check and comprehensive revise of the writing to improve the manuscript’s quality.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

There are appended registration result figures in the manuscript. The manuscript does not mention the publication of the code if accepted.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
- Both simulated and real images are from the same source of 6 cadaveric specimens. The trained network model is very likely to be overfitted to the six CT scans. The training/validation/testing simulated X-ray data are all randomly assigned from the 6 CT scans. The test set CT data should be left out from the training and validation data. To evaluate the model’s performance accurately, it is better to perform kFold cross validation testing. The method needs to be further validated on more unseen data to present its generalization ability.
- The reviewer has noticed that the main method figure, Fig.1 is presented differently in the main submission file and the supplementary file. The quality of Fig.1 in the main submission file is very poor. The text, shapes, and images of the CNN regression model is blurry and hard to read. The segmentation model input CT has non-English characters. In Fig.2, the 3D pelvis models and projection geometries are rendered at three different view points. However, the 3D renderings do not present the differences between the ground truth pose and the predicted pose. It is unclear whether the rendered poses are good or bad.
- In equation (1), R is used to represent the X-ray transform function. This is confusing because R is used as rotation matrix in the first paragraph of Section 2.1 Problem Formulation.
- Gross Failure Rate is not defined. It is unclear what the metric is to define a failure case. The mTRE calculated on projected image points (Table 2) is also not clearly defined. The reviewer finds it necessary to add clear quantitative definition of the evaluation metrics, such as adding formula.
- The authors claim that DFLNet could not adapt to real X-ray images and left it out from the real testing results, but the did not explain why DFLNet does not adapt to real X-ray images.
- In Section 2.2 Registration, the sentence “First, the scene coordinate regression where a single view X-Ray image is input to a U-Net model to obtain scene coordinates” has grammar issues. The sentence in the same paragraph, “Third, the segmentation of CT-scan volume to obtain a 3D model of bone” and “fourth, …, to camera coordinates”, are incomplete sentences. The reviewer suggests a through checking of the writing and language of this paragraph. The current version is confusing and hard to read.
- In Section 2.2 Scene Coordinates, there is a typo of point x_ij. In Equation (4), the right bracket is missing.
- In the Limitations Section, the authors said “the optimization still requires a substantial amount of runtime for convergence” and “a substantial amount of offline time and resources” It would be better to present quantitative runtime and resource analysis.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The technical contribution of the proposed scene coordinates is limited. The evaluation of the proposed method is less rigorous. The writing quality can be improved. Given that this work went through a complete registration workflow on real cadaveric images, the reviewer recommends weak reject.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

In the context of X-ray to CT registration for intraoperative guidance, this paper implements a rigid 2D-3D registration that estimates for each 2D pixel of the X-ray image, a 3D point of the CT, if it exists, that would project at the pixel location (denoted as “scene points” in the paper). Then using the 2D/3D correspondences the rigid pose of the CT can be recovered solving the PnP problem with RANSAC (as the intrinsic parameters are known).

The scene points estimation relies on a U-net model trained to infer the scene points from the 2D image. More exactly, the expected position and its variance of each scene point will be inferred so that excessive variance will be synonym of 2D pixel actually not matching with any 3D CT point.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper presents an interesting way of 3D point regression from 2D image to solve the 2D-3D registration problem. The cleverness can be found in the design of the loss function which prevents the use of segmentation approach on the X-ray image.

The paper is well written and the experiments involve simulated and real data – providing convincing results with respect to published works.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

No major weakness can be really pointed at. Maybe the experiments could have benefited from the comparison with more accurate methods (e.g., intensity-based) but known to have a small capture range.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

While not all information is provided (e.g., code), authors did report in the manuscript what they stated in the reproducibility form.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Limitations section:

When you wrote “the optimization still requires a substantial amount of runtime for convergence.”, how much are we talking about? Usually forward passes are reasonably fast (with respect to training), I wonder why it would take so much time. Maybe the authors can comment on that.

Specific comments:

Replace incorrect acronym: Gross Failure Rate is referred to as “GFR” instead of “GRF”. Fig. 1: Japanese characters are visible in the figure.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is well written, clear, presents original work on 2D-3D registration and compares to existing approaches.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

This paper presents a coordinate regression-based method for 2D/3D registration between X-Ray and CT volume. The method, given an input X-Ray image, uses a CNN model to predict 3d location information of each pixel w.r.t the 3D CT mesh model. PnP and RANSAC are then applied to obtain the rigid transform to align the X-Ray image to the CT model. The experiment shows that the proposed method outperforms one pose regression-based method and one landmark-based method.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Though regressing the world coordinate value of a 3D model or uv coordinate of a mesh surface in some other work, is not completely novel, it is interesting to see that this is being applied to 2D/3D registration between x-ray and CT and it seems to largely improve the performance of registration compared with two previous works, especially the metric “gross failure rate”.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The concept of regressing coordinates for pose estimation is not new and the authors just apply this to a new task.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Should be reproducible given that the method is relatively straightforward
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
1. The manuscript seems to be written in Word and there are some format issues. Authors may want to fix the format issues such as text going out of the border of the red box in Fig. 1.
2. In Fig. 1, it would be good to modify it to “CT data”
3. Above Eq (2), typo “x_ij” where j is not in the underscript
4. above “Uncertainty Estimation”, the fact that the depth d is relative to the camera plane may need to be described.
5. Eq (4), typo on the second term regarding parenthesis
6. section 3.1, doesn’t add up to 8100 with the mentioned three numbers
7. section “Real X-Ray Images”, what is the reason that DFLNet cannot adapt to real X-ray images? Does the author follow the original paper on this aspect?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method is simple and yet improves the gross failure rate compared with the two previous methods dramatically.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.
The authors proposes a novel application of a CNN-based scene coordinate regression approach for 2D-3D registration between X-ray and CT images. Most of the reviewers are positive about this paper. After reading the comments raised by the reviewers, I recommend provisional acceptance of the work. The authors are recommended to address the reviewers’ constuctive comments summarized below in the paper’s final version.
- The authors need to address formatting issues and typos throughout the paper, especially in figures, equations, and the narrative.
- There is a need for further validation on unseen data to demonstrate the model’s generalization ability.
- More clarification is needed on some aspects like why DFLNet cannot adapt to real X-ray images, the runtime of the optimization, and clearer definition of evaluation metrics.

Author Feedback

We thank all the reviewers and area chair for their reviews and constructive comments. As many have pointed out, we would thoroughly examine the paragraphs for grammatical errors, typos, and unclear figures. We would like to address the four main comments provided by the reviewers below.

Need for further validation on unseen data to demonstrate the model’s generalization ability (R2, meta-reviewer’s second point) Regarding the comment by R2, we would like to clarify that the models are trained per CT scan (patient-specific model). On the other hand, we demonstrate the patient-specific model’s generalizability to real X-Ray images. In practice, a preoperative CT scan is generally available beforehand. This allows the generation of simulated X-Rays that can be used to train a patient-specific model. We will modify the manuscript to emphasize this and prevent potential misinterpretations in the experiments.

Why DFLNet cannot adapt to real X-ray images (R1, R2, meta-reviewer’s third point) Since our dataset consisted mostly of images with partially visible hips, only a few landmarks are visible per image. This causes the DFLNet to overfit to the partially visible landmark distribution while our proposed model mitigates this issue by learning the general structure (i.e., every surface point that is visible). We will extend the discussions section to include this reason and provide the qualitative analysis using the estimated heatmaps in the supplementary materials.

Comment on the runtime of the optimization (R2, R3, meta-reviewer’s third point) In our pose estimation module, RANSAC takes a few seconds per image to find a good pose since the scene coordinate regressor provides dense correspondences. This is in contrast to the landmark-based method where the convergence speed is an order lower due to a small number of correspondences. In the final manuscript, we will add a quantitative runtime analysis for results in Tables 1 and 2.

Clearer definition of evaluation metrics (R2, meta-reviewer’s third point) The gross failure rate is defined as the ratio of successful registration (i.e., mTRE smaller than 10mm) over the total number of images. The projected mTRE metric used for evaluation in real X-Ray images in Table 2. is defined as the mean L2 error between the projected landmark points and the annotated landmarks. We will add a brief description with formulas for each metric in section 3.3.

back to top

X-Ray to CT Rigid Registration Using Scene Coordinate Regression