Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Marc Demoustier, Yue Zhang, Venkatesh Narasimha Murthy, Florin C. Ghesu, Dorin Comaniciu

Abstract

Device tracking is an important prerequisite for guidance during endovascular procedures. Especially during cardiac interventions, detection and tracking of guiding the catheter tip in 2D fluoroscopic images is important for applications such as mapping vessels from angiography (high dose with contrast) to fluoroscopy (low dose without contrast). Tracking the catheter tip poses different challenges: the tip can be occluded by contrast during angiography or interventional devices; and it is always in continuous movement due to the cardiac and respiratory motions. To overcome these challenges, we propose ConTrack, a transformer-based network that uses both spatial and temporal contextual information for accurate device detection and tracking in both X-ray fluoroscopy and angiography. The spatial information comes from the template frames and the segmentation module: the template frames define the surroundings of the device, whereas the segmentation module detects the entire device to bring more context for the tip prediction. Using multiple templates makes the model more robust to the change in appearance of the device when it is occluded by the contrast agent. The flow information computed on the segmented catheter mask between the current and the previous frame helps in further refining the prediction by compensating for the respiratory and cardiac motions. The experiments show that our method achieves 45% or higher accuracy in detection and tracking when compared to state-of-the-art tracking models.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43996-4_65

SharedIt: https://rdcu.be/dnwQd

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #3

  • Please describe the contribution of the paper

    The study introduces a Transformer-based model for real-time device tracking in medical application. In particular, the authors describe the application of their model to track the location of catheter tip, an interventional technique used in over 1.2 million cardiovascular interventional procedures. And therefore, propose a clinical application that can aid physicians in complex interventional procedures.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strengths: 1) The authors have proposed a transformer-based architecture for tracking catheter tips, in real-time using catheterization imaging datasets from 44,957 annotated frames for training and validation, and tested their model on 17,988 frames. 2) The introduction and related work is clearly described and relevant literature is discussed nicely throughout the text. 3) The algorithm and evaluation methodology is well explained. 4) The authors present a strong evaluation methodology and the findings of the study support the objectives and conclusions of their work. 5) The benchmark studies and results are also compared to other state-of-art methods and outperform them significantly. 6) Thorough methodological analysis is also presented in the study supported by the ablation study, Table 2. 7) Overall, the findings support the potential of clinical translation of this study.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some weaknesses of the work are: 1) While the authors state that this study uses internal imaging datasets. It is not mentioned whether the study was approved by the Institutional Review Board. 2) Even though some complex cases such as presence of sternal wires, stent devices, etc is taken into account, can the authors comment on the accuracy of their model in patients with complex PCI disease cases such as bifurcation, serial and left main lesions in the vessels. 3) It is mentioned in the study that the work uses 2,314 sequences consisting of 198,993 frames, of which 44,957 are annotated, for training and validation, and 219 sequences consisting of 17,988 frames for testing. Could it be explained how the non-annotated images were used in the study? 4) Since datasets are typically limited and transformer based models require large training datasets, it is recommended that the authors describe data titration experiments so that other field experts interested in this application can determine what minimum datasets are required for using such models.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Yes, for the algorithm description and relevant hyper parameters of the model.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The authors are suggested to address the following comments: 1) While the authors state that this study uses internal imaging datasets. It is not mentioned whether the study was approved by the Institutional Review Board. 2) Even though some complex cases such as presence of sternal wires, stent devices, etc is taken into account, can the authors comment on the accuracy of their model in patients with complex PCI disease cases such as bifurcation, serial and left main lesions in the vessels. 3) It is mentioned in the study that the work uses 2,314 sequences consisting of 198,993 frames, of which 44,957 are annotated, for training and validation, and 219 sequences consisting of 17,988 frames for testing. Could it be explained how the non-annotated images were used in the study? 4) Since datasets are typically limited and transformer based models require large training datasets, it is recommended that the authors describe data titration experiments so that other field experts interested in this application can determine what minimum datasets are required for using such models.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The strength of the study outweighs the weaknesses which can potentially be addressed with minor revisions.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #1

  • Please describe the contribution of the paper

    The paper “ConTrack: Contextual Transformer for Device Tracking in X-ray” proposes a transformer-based model to locate the tip of a catheter in X-ray. Such a capability is useful to co-relate fluoroscopic and angiographic images of the same scene, as well has provide a basis for IVUS image reconstruction during a pullback operation. Technical contributions include using multiple templates for more robust matching, flow estimation to accommodate motion between image frames, and improved tip localization accuracy over existing methods (median ~2 mm down to ~1 mm or about 45%).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strengths of the paper:

    1. Catheter tip localization in X-ray is a clinically relevant problem with potential positive impact once solved
    2. Novel localization algorithm/model to address shortcomings of existing methods based on understanding of the underlying clinical scenario
    3. Clear accuracy improvement over existing methods
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Weaknesses of the paper

    1. The description of the methodology is somewhat unclear. The way it is organized the reader must read in multiple passes to understand the full picture
    2. The collected data is not well described, e.g., patient demographics, collection protocol, procedure phase.
    3. While it is clear that the proposed method is superior to existing methods on the collected data, it is not clear if said data is representative of real procedures which involve device manipulation, or how the method is preferred over simpler methods e.g. using the built-in radiopaque markers that are commonly found on catheters.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The study uses an internal dataset so reproducibility is not applicable in this case; the description of the method is feasible, though the ability to implement the method independently is not known because we do not have the underlying hyperparameters.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    The proposed work shows promise by introducing a new method for a clinically relevant problem, based on shortcomings of existing solutions and an understanding of the clinical context (e.g., other devices interfering with tip tracking). Furthermore authors demonstrate promising performance when compared against existing methods on an internal dataset. There are just a few areas where tweaks to the paper and the experiments can greatly improve the quality of the paper. Some comments are in the form of questions raised by the paper, which if address would greatly strengthen the clinical motivations.

    Abstract

    • The abstract could be rephrased to clarify the clinical motivations and the corresponding contributions. This reader is familiar with clinical procedures such as PCI, EP, and interventional radiology yet had trouble parsing the wording of the abstract.
    • The relationship to IVUS is not quite clear to the present work, so this connection might need to be explained in the body of the paper and perhaps omitted from the abstract.

    Introduction

    • Similar to an above remark, it is difficult to follow how the present work differs from prior work when the present work has not yet been explained to the reader, so reordering details can boost clarity greatly
    • It is mentioned that Cycle Ynet suffers from drift over long sequences. The question arises in relation to the proposed method of what defines a “long” sequence in a clinical context?

    Methodology

    • In a real clinical case, how might tip tracking be initialized, activated, and reinitialized if tracking is lost? How might the system reset when the fluoro pedal is engaged after a long time?
    • How important is tip tracking during a contrast injection? The contrast medium persists for only a few seconds and clinicians normally pause operations during this period, particularly since contrast is manually injected. Would it be clinically beneficial for clinicians to proceed while an injection is taking place?
    • Clinicians normally use radiopaque markers on the devices for localization as they are very clear, so why is it preferred to overlook this feature?
    • How are template frames to be acquired/updated? If it is continuous, there could be a drift problem where mistracking in one frame leads to the error propagating to all subsequent frames until the tip is completely lost. How would template selection be performed to overcome this potential issue?
    • Similarly, how are templates verified to be high quality templates, particularly during a live procedure?
    • How many templates is ideal for sufficient performance to be achieved? How much does performance improve with each additional template, and how different should templates be from each other to be useful?
    • We see from the experiments that tracking accuracy is not always perfect, so as helpful as it is to know where the tip is, it would also be useful to know when the tracking is errant so that the clinician or application knows not to trust the result. How might the approach provide guarantees of accuracy vs. less accuracy?
    • How is the ground truth manually annotated if obscured by contrast agent?
    • Motion flow in segmentation space is a neat idea - just need to ensure that the segmentation is good so as to mitigate drift

    Experiments and results

    • To evaluate the results more faithfully the reader would want to know more of the characteristics of the collected data, such as the approximate duration for each patient, which phase of the procedure was acquired and why said phase was selected, etc. Since the data involves real patients, at least some of the study approval and other parameters should be provided to the extent that anonymity remains preserved.
    • It is not apparent whether the catheter was manipulated during data collection. Tip tracking capability would be much more valuable while the device is maneuvered, otherwise clinicians can remember where the tip was once the contrast dissipates. Human vision is also well adapted to track breathing and cardiac motion.
    • How much accuracy might be necessary in different clinical contexts? Is what is achieved in the present work sufficient, or would greater improvements be needed?
    • Median errors are lower than mean errors, suggesting a skew towards larger errors. Were there any patterns observed in what causes greater errors? Does such high errors render the tracking unusable during those frames, and do templates have to be reset in this case?
    • There is not enough information to know how many breathing/cardiac cycles were captured by the dataset, so there could be potentially many redundant datapoints both in the training and testing sets. The comparison against other methods may remain conclusive but how the method extends to the broader procedure remains a question, particularly as the device is manipulated.
    • X-ray, particularly flouoroscopy, is not very high resolution so reporting accuracy to one-hundredth of a millimeter is somewhat awkward. The spacing between pixels is already ~0.3 mm so it is not likely to achieve 1/100 mm precision.
    • Table 2: The ablation study is quite interesting. It shows an interesting breakdown of the impact of various components. My only commentary is that many of the differences are within one pixel (<0.3 mm), and often times it is random which pixel is selected as the tip position, so the differences may not be very significant.

    Minor

    • p2: “a similar network add (sic) a graph convolutional neural network”
    • Sec2.2: “interferring”
    • p6: warpping
    • Fig 2b: One of the inputs is labeled “Decoder output” - is the input to (b) Body-Decoder also one of its own inputs?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper shows promise in improving catheter tip localization which is a clinically relevant problem, tempered by lack of information about the characteristics of the acquired data that would convince the reader that the superior performance would extend to live clinical scenarios (i.e., beyond a few particular seconds during a case).

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The objective is to track the catheter tip in percutaneous coronary interventions procedures. The approach retained is based on Transformer.

    One application envisaged in the concept of roadmap: fusion of the injected vessels over the fluoro. The compensation of the cardiac and respiratory motion comes from the detected catheter tip used as anchor point.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of the paper is the very detailed experimental work being done. Part of it is required for the training itself. The second part is for performance assessment.

    The approach includes several modules supported by a transformer feature. It is described high level and does not raise any special comment.

    • I congratulate the authors for the distinction between 3 categories: fluoro, angio and devices. An additional point would have been to add a mention of the coverage of the test data set regarding x-ray exposure situation which is very dependent from patient thickness and gantry angulation. Large gantry angulation and large patient thickness requires increased kV which decrease the contrast in the image and also increases the noise level.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In the state of art discussion, the authors forget to mention iconic approaches with an explicit segmentation of the catheter. The catheter is highly structured object with two parallel edges. One may think that doing an explicit segmentation is manageable.

    • Performance in time is not mentioned. It should also include the computational power required.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility can be questioned given the huge volume of annotated data to collect.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The volume of annotated data assembled both for the learning phase and the testing phase. For the testing phase, it is in relation with the level of confidence expected in the performance assessment. For the training about 200 000 frames with 44 000 of them are annotated. As said by the authors, it is an effort to assemble and to also to annotate. How long was the annotation task? It would have been interesting to compare the volume of annotated employed for training with the other approaches (CycleNet, etc). At this point, I consider that it is a strong limitation of the retained approach? It could be interesting to report performance result as a function of the volume of data.
    • I have not seen any explanation of the use of the non annotated data. Is it used for the training on the transformer ?

    • Regarding the final application, image fusion between different successive angulations and the context of interventional procedure, the need of providing an initial localization is a strong limitation. The conclusion and perspective discussion shall include comments on how the authors intend to handle this point and if it is going to extend their proposed tracking algorithm.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall the paper represent a significant effort with good consideration to clinical needs. At this point, the main limitation is about the initialization of the tracking.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper proposes a transformer-based tracker using both spatial and temporal contextual information for real-time device detection and tracking in x-ray fluoroscopy and angiography. The topic is clinically relevant and of interest to the community, the approach is novel, the experimental work is thorough, and validation experiments show superior performance in tracking compared to SOTA.

    Feedback from reviewers regarding additional clarification regarding the methodology, more details of the dataset (including the annotations, and the coverage with regards to x-ray exposure), additional discussion of results (including how method is preferred over more traditional segmentation approaches, and performance under more complex clinical cases) should be incorporated in the final submission.




Author Feedback

We would like to express our sincere gratitude to the reviewers and the AC for their valuable feedback and constructive comments. All the reviewers have acknowledged the superior performance of our novel spatial-temporal tracking framework and the comprehensive evaluation we conducted. Due to space constraints, we have grouped the questions to provide a concise response, and we will provide clarifications in the final version of the paper.

Dataset and annotation. Our dataset consists of 2,533 X-ray sequences and 62,945 frames that were annotated by a dedicated annotation team and reviewed by experts. During occlusion (using contrast or other devices), annotators utilize the temporal aspect to ensure consistency. To reduce annotation effort, we train our approach using template and search images, which are subsets of the entire sequences. This allows us to annotate the training data sparsely while still capturing most of the variations. However, to ensure a comprehensive model evaluation, we annotate the entire sequences in the test set to avoid any compromises.

The complete dataset comprises video clips that encompass a wide range of scenarios, including different phases of PCI procedures, varying x-ray dosages (which impact image quality), catheters with various widths (ranging from 6 to 10 French), and catheters with or without radiopaque markers. For training and testing purposes, the dataset is divided at the patient level.

Methodology. 1) Regarding tracking initialization, it can be performed manually or through the use of an automatic detection model. In this paper, we focus solely on the tracking module and follow the established setups (e.g. [3], [5], [6], [14] in paper), leveraging manual annotation on the first frame for model evaluation. In general, long-term visual tracking with automatic (re-)initialization is a challenging problem and require a system of approaches. A safe and automatic system of device and anatomy tracking is of great clinical relevance and will be an important future work for us. 2) Regarding comparison with segmentation approaches, due to the complex scenes including catheter bending, device interference, and contrast injection, segmentation approach without temporal perspective suffers from robustness issues and thus it produces inconsistent tip localization. However, they do provide valuable insights for inferring catheter tip motion. Therefore, we use the motion information as a surrogate signal in our refinement module to enhance the accuracy of the previously localized tip position.

Clarity. We conducted a study to determine the optimal number of templates and found that three templates yield the best performance. Due to the space limit, the findings are presented in the supplementary material. The flow estimation module relies on segmentation from neighboring frames. Since the training data is sparsely annotated, we utilize non-annotated data (i.e., the frame preceding the annotated frame) to generate the segmentation predicted by the segmentation module, as depicted in Figure 1 (c). In order to further enhance the clarity of the paper, we will incorporate the suggestions made by the reviewers regarding the phrasing of the abstract and introduction. Additionally, we plan to include a table detailing the hyperparameters in the supplementary material for the final version.



back to top