Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jinhee Kim, Taesung Kim, Taewoo Kim, Jaegul Choo, Dong-Wook Kim, Byungduk Ahn, In-Seok Song, Yoon-Ji Kim

Abstract

Diagnosis based on medical images, such as X-ray images, often involves manual annotation of anatomical keypoints. However, this process involves significant human efforts and can thus be a bottleneck in the diagnostic process. To fully automate this procedure, deep-learning-based methods have been widely proposed and have achieved high performance in detecting keypoints in medical images. However, these methods still have clinical limitations: accuracy cannot be guaranteed for all cases, and it is necessary for doctors to double-check all predictions of models. In response, we propose a novel deep neural network that, given an X-ray image, automatically detects and refines the anatomical keypoints through a user-interactive system in which doctors can fix mispredicted keypoints with fewer clicks than needed during manual revision. Using our own collected data and the publicly available AASCE dataset, we demonstrate the effectiveness of the proposed method in reducing the annotation costs via extensive quantitative and qualitative results.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_65

SharedIt: https://rdcu.be/cVRuQ

Link to the code repository

https://github.com/seharanul17/interactive_keypoint_estimation

Link to the dataset(s)

AASCE dataset: http://spineweb.digitalimaginggroup.ca/Index.php?n=Main.Datasets


Reviews

Review #1

  • Please describe the contribution of the paper

    An interactive X-ray image keypoint estimation method is presented in this submission. The proposed approach aims to reduce the manual correction cost, instead of fixing each of the wrongly predicted keypoint, a user only needs to correct one point, all other keypoints would be updated as the user’s modification to one keypoint is propagated to other points. The proposed approach is evaluated on multiple datasets (AASCE and others).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This submission tackles the problem of automatic keypoint estimation correction in X-ray image, where manual correction of multiple keypoints could be time-consuming and inefficient. Brute force approach is to correct each keypoint independently, the proposed method aims to reduce user interaction by introducing an interaction-guided gating network, to propagate the user input across the image.

    • A morphology-aware loss is proposed based on the observation that the degree of freedom between the keypoints is small in X-ray vertebra images, which regularizes the network to learn the inter-keypoint relationship to be similar to that of the ground truth.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Part of the explanation and discussion in the experiments section is confusing. The definition of manual revision in Fig. 4 is not clear, the user interaction is also a type of manual revision.

    • The performance of the proposed approach is close to the general method RITM [16]. As shown in Fig. 4, the MRE for increasing the number of user interactions, the curve of RITM is similar to the one of the proposed method. It is the same in the right figure as well in Fig. 4. Considering the proposed method is tailored for X-ray vertebra images, large improvement is expected but not observed.

    • It is shown that the proposed method can correct most of the wrong keypoints based on the input where a user corrects only one point. However, which keypoint should be selected and corrected by the user to achieve such an efficient user correction is unclear.

    • The limitations of the proposed approach and the directions for future research are not discussed in this submission.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors include code with detailed readme as part of their supplementary materials. Demo video is provided as well to demonstrate the user interaction process.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    This submission tackles the problem of fixing the errors in automatic keypoint estimation from X-ray images for vertebrae applications, which can be improved from the following perspectives:

    • Revise the explanation and discussion in the experiment section, especially the part for Fig. 4.
    • Discuss why the proposed method is close to RITM in MRE for increasing the number of user interactions.
    • Discuss how to select the one keypoint to correct so that the information can be used to correct all other keypoints, does it matter which keypoint is selected?
    • Discuss the limitations and directions for future research.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The current rating is based on the following major factors:

    • This submission tackles an important problem in fixing the errors in automatic keypoint estimation for medical images.
    • The proposed method has the potential to improve the user correction efficiency.
    • The discussion and explanations in the experiments are not clear enough.
    • The discussion of the limitations/future work is not provided.
  • Number of papers in your stack

    1

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The herein paper presents an X-ray landmark detection algorithm with the possibility of interactive corrections that can be made by the end user. Taking into account the morphological information of the anatomy, the revised landmark detection adheres to the structural constraints of the desired object (e.g., cervical spine). Although interactive segmentation networks have been previously proposed, to the best of my knowledge this algorithm is one of the first efforts on the use of artificial intelligence for interactive landmark detection in X-ray imagery. Although not directly translatable, a comparison to the state-of-the art methods for interactive segmentation algorithms is also provided.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The underlying assumption about keeping the clinician in the loop of landmark identification is very reasonable given that the presented network can fully benefit from the contextual knowledge of the clinicians in revising its predicted landmarks. The implemented network architecture (despite being already published as an RITM network) along with the morphology-aware loss seem appropriate choices for the application at hand. The provided evaluation study can support the paper’s conclusions by providing a comparison to the state-of-the-art interactive segmentation methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In general, the amount of detail presented for this algorithm is not consistent throughout the paper. While some components of the network are explained in high detail, crucial information about the network architecture and the training process are completely missing. For instance, previous prediction of the network appears to be a separate input channel to the network while it’s not clear how this is imbedded inside the training workflow. Furthermore, aspects related to the gating network and the morphology are loss are not well communicated therefore one may find it difficult to fully understand the underlying methodology behind those network components. Additionally, the writing tone and clarity can be improved to help with the general comprehension of the paper. For instance, the authors’ statement about the NoC_5@3 and FR_5@3 is extremely hard to grasp due to subpar writing quality.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    As supplementary information, the authors have provided access to the code and an illustrative video showcasing the performance of the algorithms on their test dataset. Within the paper itself however, no reference to the implementation details including the utilized libraries, programing language and the network hyperparameters are provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    In general, the writing tone and style can be revised to help with understandability of the implemented methods. Despite the fact the backbone network architecture is borrowed from an existing method, the authors should better explain the general information flow inside the network and the associated training process. Aspects related to the gating network and the morphology aware loss were hard to follow given the lack of proper explanations regarding the underlying requirements of those components and their added benefit. Although appreciating the effort in comparing the developed landmark detection algorithm to the existing interactive segmentation networks and despite the fact that the authors have performed this comparison on the basis of heatmaps (for consistency), one may argue that this is not an adequate evaluation given that those methods were inherently developed for completely different applications.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed algorithm is novel and the targeted use case is of high clinical value. Significant details regarding the network architecture and the training process are currently missing in the paper. Aspects related to the gating network and the morphology-aware loss are not adequately explained.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors present an approach for interactive landmark estimation refinement. They utilize iterative user input similar to attention weights resp. as gating mechanism for the main network and additional regularize the prediction of the network using a “morphology-based loss” that is derived from dataset statistics. The authors evaluate their approach on one public and one private data set with synthetic user interaction and demonstrate improvements compared to three (+1) frameworks originally developed for interactive segmentation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The authors present an interesting interactive strategy for landmark detection with convincing improvements compared to other approaches.
    • The authors evaluate their strategy against three other approaches.
    • The method is comparatively straight forward and modular with an interesting combination of attention-based interaction and additional regularization based on expected morphology.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • There are some issues in the mathematical & formal description of the method, which I would encourage the authors to revise (see below).
    • The concept of “low-variance”/”high-variance” landmarks is only described superficially (e.g., was there a specific threshold selected? how?).
    • The authors do not evaluate nor discuss “noisy” interactive inputs (or repetitive corrections of the same point) by real users, but only work with “simulated” interactions (based on the ground truth).
    • Only mean results are reported, without variance and / or effect of repeated training. Failure cases are not analysed further.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper is partially reproducible. The authors aim to provide the code (training and evaluation) as well as the pretrained models, which means that it should be possible to reproduce the results on the smaller, publicly available data set. The in-house data set will not be made publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    General comments:

    Introduction:

    • One clear example of where key points are used clinically (ideally with reference to the tasks evaluated) would be appreciated - the motivation is currently very general and superficial
    • Since the authors work on X-ray projections, I would encourage them to include further citations citation for medical images/X-rays: Bier et al., MICCAI 2018, https://doi.org/10.1007/978-3-030-00937-3_7; Kordon et al., MICCAI 2019, https://doi.org/10.1007/978-3-030-32226-7_69
    • p. 2: “Also, RITM … “ - The authors may want to summarize the main idea behind RITM shortly (and introduce the abbreviation), furthermore, the terms “HRNet-W32” or “hint fusion layer” mentioned later in the text may not be clear to a majority of readers and could be shortly explained.

    Method:

    • p. 4: The formalization of Eq. (1) doesn’t seem to fully hit the nail on the head, and should be revised. The line above the equation kind of predefines n as a point with a user interaction. In the equation itself, there is then an additional condition, i.e., n \in {l_1, l_2, …}. From my perspective, the subset of “interaction landmarks” should be more clearly defined (as it is defined now, it may refer to any landmarks…). Potentially, the authors want to define a subset of adapted landmarks L_adapted \subset L_all.
    • In the same equation, I would expect to see the position described by the “user interaction” itself, not the ground truth position. This can be synthesized by employing the ground truth…
    • Using a cross-entropy loss for heatmap regression (alone) strikes me as relatively unusual - did the authors also experiment with L2-loss/L1-loss?
    • “We apply the global pooling method on the feature maps to aggregate the most activated signal per channel; the resulting vector is in Rdw. Specifically, we adopt the global max pooling layer, which selectively retrieves the important interaction-aware features for each channel. “ - this is phrased rather complicated, wouldn’t it suffice to formulate something along the lines of: “We use channel-wise global max-pooling to obtain channel-wise activations. These are further processed by two fully connected layers (I presume - from the equation) and a sigmoid activation function to form per-channel gating weights for the main network.” (Eq. 2 doesn’t really add more clarity to the paper)
    • p.5: The authors state that a subset of landmarks were selected to add the morphology-based loss. How where these landmarks selected? Was there some threshold of variance? This should be explained at least shortly and is currently rather vague.
    • It may not be clear to every reader that the soft-argmax function is an essential ingredient to get the morphology-based loss working. I would encourage the authors to make it more clear how the heatmaps can be used to derive a morphology-based loss.
    • p. 6: The authors should clarify whether/that a patient-wise split was applied to the data

    Evaluation & Discussion:

    • There is no guarantee that an annotated point will actually end up where the user wants it to be, correct? Were there any border cases observed? This should be shortly discussed.
    • “when comparing the performance of Vertebra-focussed and each model …” - not clear, there also seem to be some repetitions in the results section. I would encourage the authors to double-check this section again.
    • I don’t quite understand Fig. 4: What is the difference between the methods with and without manual revision? Does this just mean that one landmark is moved? If this is the case: Couldn’t this also be reported for the “Vertebra-focussed model”?
    • Fig. 5 is not fully clear - what are the values mentioned below the images (Initial, After, delta)? Why is there gain for one image but “delta” for all others?
    • The authors only report the mean error, here, a more detailed analysis of at least variance across images / repeated training etc. would be expected. Additionally, an analysis of failure cases would be desirable.
    • The ablation study is not very clearly described. E.g., how are low/high variance differentiated? Are all adjacent points included? Also, the discussion of these results is a bit superficial. I am missing an ablation that only uses the morphology aware loss (without the interaction guidance).

    Minor comments & typos:

    • Abstract: additions such as “as shown in Fig. 1” should not be contained in the abstract.
    • p.2: “motivated by SE block” - abbreviation SE should be introduced.
    • p.2: “vertex points on the cervical vertebra have limited deformation” - this is rather unclear - what do the authors mean here?
    • p.2: “… user modifications than manual revision” - than > compared to
    • p.2: “Adding the proposed gating network on the model …” - unclear - how is a network added “on” the model - to? combined with? What is this “model”? (see also on p. 3)
    • p.4: “are filled with zero matrices” - why not simply “are filled with zeros” or “contain only zeros”
    • “It allows all pixel positions in the feature maps of the main network to attend to each significant pixel position with respect to the user interaction information.” - I don’t quite understand what the authors want express here.
    • p.5: “the ones that rarely deviates …” - typo
    • p.5: “as the criterion to apply the proposed loss” - What is meant here?
    • p.6: “… to achieve the MRE under 3” - measurement unit of 3 should be mentioned
    • p.6: Why were the thresholds selected like they were? i.e., 3/5/10?
    • p.6: “the images with high error than … “ - grammar
    • Fig. 4: “reivision”
    • References: Capitalization seems a bit off.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach of combining interaction with a morphology-based loss is very interesting and seems to guide the network in multiple aspects. I haven’t yet come across such an approach for landmark detection (but I am also not fully familiar with the corresponding literature in the CV domain). It is generally well evaluated with multiple reference methods as baselines. Some of the weaknesses (description of the method, missing variance in the results, discussion) can be rectified rather easily, one aspect that I would have liked to see in addition is a user study that confirms the improvements in addition to “simulated interactions”.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The authors present a novel method for interactive keypoint detection. The input of the network includes original image, previous predicted keypoint results and user interactive information. A gated module is used to selectively fuse image feature maps and user interactive clues. In addition, a high-order keypoint relation loss is added. The proposed method is evaluated on two datasets. Ablation studies are provided as well.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The strengths of the paper: 1: A novel interactive keypoint refinement method is proposed for keypoint detection tasks in the medical image domain.

    1. An interaction-guided gating module is introduced to propagate the interaction clues in the network.

    2. Additional loss for keypoint estimation is proposed to incorporate high-order information between local keypoints.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The presentation of the method is great. The authors also provide ablation studies of the effectiveness of the presented component.

    In Table 2, it would be great to include results with morphology-aware loss, but without the gating module. It then can show the significance of these two main contributions.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors provide the source code in the supplementary files.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The overall presentation of the paper is great. I have few suggestions for the paper.

    1. In Table 2, it would be great to include results with morphology-aware loss, but without the gating module.

    2. The detail of the hint fusion layer is missing in the paper.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although interactive segmentation methods have been studied, there is no such study in keypoint estimation. In this paper, the authors present a novel method to explore the interactive keypoint estimation in medical images. In addition, the authors provide a new loss for keypoint estimation to encode the morphology or relationship information of keypoints.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper received four highly detailed reviews, all of which agree that the method is interesting, the evaluation using three benchmark methods is convincing, and that there is sufficient novelty (despite RITM having previously been proposed). While the reviews are generally enthusiastic, the reviewers all echoed concerns around the clarity of the paper (including possible inaccuracies in the mathematical description) and insufficient discussion. As appropriate, these concerns should be taken into account in preparation of the final manuscript.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1/17 (>10th percentile)




Author Feedback

We appreciate the AC and all reviewers for their valuable feedback and positive support. R1, R2, R3, and R4 denote responses corresponding to each reviewer. All notation errors, typos, and missing details will be thoroughly revised in the camera-ready to improve the clarity of the paper, e.g., additional explanations on [R1, R2] manual revision in Fig. 4, [R2] low/high variance & adjacent points in Table 2, and [R3] the previous prediction of the model.

[R1, R2, R4] Details on the RITM baseline model [R2] As stated in Section 1, Reviving Iterative Training with Mask Guidance (RITM) reactivated iterative training for multiple user revisions to make a model aware of the prediction mask created in its previous step. [R2, R4] It uses High-Resolution Network as a pre-trained backbone and proposes to employ a simple convolutional block to feed additional input, e.g., the user clicks, without any architectural changes to the backbone. Following RITM, we also used an additional block of the hint fusion layer, which takes an input image, user click, and previous prediction of a model as inputs and outputs a tensor having the same shape as the output of the first block of the backbone. [R1] Also, we proposed the morphology-aware loss (morph loss) and interaction-guided gating network, which showed superior interactive keypoint estimation performance to RITM. For example, the failure rate of our method was reduced by more than half of that of RITM in Table 1. Also, when increasing the number of user interactions in Fig. 4, our method consistently outperformed RITM by a large margin, decreasing the error by five pixels on average.

[R1, R2] Limitation and future work In this work, we assumed that a user would always provide a correct modification to the model and revise a clearly wrong keypoint first, as stated in Section 3. Thus, the keypoint having the highest error for each prediction was selected to revise in all experiments. The future work can address noisy interactive inputs by real users, as R2 stated, or find the most effective keypoint to correct all other keypoints and recommend a user to revise it first.

[R2, R4] Additional ablation study We ablated the proposed gating network on Cephalometric X-ray, and the results in the order of the metrics (FR5@3, NoC5@3, NoC5@4, NoC5@5) were: gating & morph loss, (4.48, 2.32, 0.86, 0.31); only morph loss, (6.33, 2.56, 1.03, 0.38). These show that the gating network significantly contributes to performance improvement. We will add these results in the camera-ready.

[R2] Additional Response

  • We used binary cross-entropy (BCE) loss since it showed better performance than L1 and L2 losses. The results on Cephalometric X-ray in the order of the metrics (FR5@3, NoC5@3, NoC5@4, NoC5@5) were: BCE loss, (4.5, 2.3, 0.9, 0.3); L2 loss, (11.7, 3.1, 1.7, 1.0); L1 loss, failed to converge.
  • The proposed morph loss is applied on 2D keypoint coordinates. Thus, rather than employing the argmax function, we used a differentiable soft-argmax function to extract 2D coordinates from predicted keypoint heatmaps.
  • We applied a patient-wise split to the data.
  • We post-processed the model predictions so that a user-revised point is not updated by a model, staying where the user wants it. Thus, re-correcting the same point is also unnecessary in our setting.
  • We set the range of the target mean radial error values as [0, 10] and [0, 60] for Cephalometric X-ray and AASCE, respectively. We selectively reported the results for some of them in Fig. 4. In our supplementary material, Fig. 6 shows the results on different target values.

[R3] Adequateness of interactive segmentation models as baselines We assessed the proposed interactive keypoint estimation model by comparing it with interactive segmentation approaches, which share similar concepts in the interactive system. Along with the ablation study, it is an essential evaluation of our method, given the lack of research on interactive keypoint estimation.



back to top