Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Paula López Diez, Kristine Sørensen, Josefine Vilsbøll Sundgaard, Khassan Diab, Jan Margeta, François Patou, Rasmus Paulsen

Abstract

Detection of abnormalities within the inner ear is a challenging task that, if automated, could provide support for the diagnosis and clinical management of various otological disorders. Inner ear malformations are rare and present great anatomical variation, which challenges the design of deep learning frameworks to automate their detection. We propose a framework for inner ear abnormality detection, based on a deep reinforcement learning model for landmark detection trained in normative data only. We derive two abnormality measurements: the first is based on the variability of the predicted configuration of the landmarks in a subspace formed by the point distribution model of the normative landmarks using Procrustes shape alignment and Principal Component Analysis projection. The second measurement is based on the distribution of the predicted Q-values of the model for the last ten states before the landmarks are located. We demonstrate an outstanding performance for this implementation on both an artificial (0.96 AUC) and a real clinical CT dataset of various malformations of the inner ear (0.87 AUC). Our approach could potentially be used to solve other complex anomaly detection problems.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_67

SharedIt: https://rdcu.be/cVRuS

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper proposes a method for inner ear abnormality detection based on a deep reinforcement learning model for landmark detection trained in normative data only.

    The method is based on multiple landmark locations in CT scans using communicative and standard multiple agent reinforcement learning (C-MARL and MARL).

    After landmark location, two different metrics for measuring abnormalities were defined: Variability across agents within a PCA subspace and Q-values.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Clarity in the exposition and detailed explanation of the problem and state of the art.

    Great validation with manual annotations, scans with congenital abnormalities (123) and normal anatomies (300).

    Results show good detection performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The methods part is not clear and detailed. 1. How do you define D_image? How is it related to d_ji? 2. How do you “merge” the Q-values for each specific landmark? it seems you have a vector of Q-values per agent and a set of runs. How do you calculate the standard deviation if those are 10 runs of vectors? Please clarify 3. How does aggregation of variances across agents and Q-values validate the hypothesis of uniform distribution? 4. Describe Fig. 4 in the caption.  5. The definition of the weighting factor looks somewhat arbitrary. How were D_training and U_training trained?  6. The weighting factor defined with the median might dramatically change the variance of wU_image after weighting. Depending on their resulting variances, that may lead to an effective C_image equal to D_image or wU_image. 

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility of the paper is low. Datasets and code are not available. The description of the dataset heterogeneity was not provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I think this is a very promising work that shows good performance in abnormal inner ear anatomy detection. The validation framework is excellent, and the proposal is sound. However, the good quality of this work is reduced by the lack of clarity in the methods part. Please, consider clarification of the methods in the following points: 1. Describe the distance between shapes properly. d_ji is the L2 norm of the difference between shape vectors b_i, b_j? 2. How do you “normalize” and “merge” the Q-values? 3. How do you calculate the STD from vectors of Q-values? 4. How does aggregation of variances across agents and Q-values validate the hypothesis of uniform distribution? 5. Describe Fig. 4 in the caption. 
 With these clarifications, I would strongly recommend publishing this interesting work.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposal, approach and validation are good. The method’s clarity can be improved.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The authors describe a deep reinforcement learning framework for inner ear abnormality detection that leverages landmark detection. The abnormality detection is expressed through 2 proposed methods. One approach uses a projection of proposed landmark configurations into PCA space and analyses the variability using a norm (Procrustes distance) defined in that space. The other uses the distribution of Q-values of the last ten states before landmarks are located (i.e. measurement of uncertainty of final landmark location) as a measure. They entire approach is based on using normative data and does not require learning representations of the anomalies. They conduct tests on artificial and clinical data and the results indicate good performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Their proposed method is training uniquely on normative data for landmark location making it easy to adapt for other anatomies and also circumventing the need to represent the anomalies, nor have for that matter have balanced data of normal vs. abnormals. This is significant for the medical imaging case where such data are difficult to obtain.

    Both abnormality measures are relatively straightforward to implement and easy to interpret. The first is a typical analysis in geometric morphometry-based anomaly detection. The second method, uses the variability in Q-values for landmark detection with the hypothesis being that normal configuration would result in a uniform distribution while abnormal configurations of landmarks would have a more varied distribution.

    The authors provide tests on both synthetic and real data with compelling results for both despite a significant drop in the performance in the real dataset.

    This seems to be the first work of its kind and presents a possible solution to a non trivial problem of anomaly detection in the inner ear.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The strength indicated above of using normative data only could also be viewed as a weakness. Without knowing the representation of anomalies, the decision boundary for anomaly detection maybe ambiguous (see Xuan et al, 2022, GAN-based anomaly detection. https://doi.org/10.1016/j.neucom.2021.12.093). For example, this approach would be limited approach in differentiating novel anatomies vs real abnormalities. However, as a first attempt this can be considered a step in the right direction.

    The core deep reinforcement learning presented is largely a straightforward application of Leroy et al (2020). https://doi.org/10.1007/978-3-030-66843-3_18 and Vlontzos et al (2019). https://doi.org/10.1007/978-3-030-32251-9_29. The innovation is therefore limited to the development of the abnormality measures.

    There is no effort to explain the difference in the two developed abnormality measures’ performance particularly for the second data set given that the two approaches are quite different.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The models and algorithms are reasonable adequately described although the reader is referred to the literature for the training procedure. The data is adequately described. No code is provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    For the first measure based on PCA, a key limitation of the covariance matrix in classical PCA is its high sensitivity to anomalies meaning increased false positives. There are more robust PCA variants available even within one of the papers the authors cite: Amor et al. (2017) https://doi.org/10.1109/DISTRA.2017.8167682. This maybe worth mentioning in a discussion section.

    The two methods presented PCA and Q-value based showed markedly different performance (fig 5e and 5f) for which the authors should try and provide any explanation.

    Some statements on limitations of the approach and future research direction would be welcome.

    The training regime is not presented and instead the reader is referred to the literature. This could be briefly outlined so the manuscript is self contained

    There is no explanation for why 3 agents per landmark are used for the test on Synthetic data whereas 1 agent per landmark is used for the Real abnormality test. An explanation should be provided.

    It’s not clear where the 92 normal anatomy images mentioned on page 5 come from until one reads the results section on page 7 which is odd. The narrative order of mention should be revised.

    A description for the normalisation of Q-value vectors on page 6 would be helpful. How are the values normalised?

    The authors may want explain their specific contributions, if any, to CMARL and MARL.

    The authors may want to improve the readability by breaking up some of the long sentences.

    Some attention should be paid to grammar throughout the manuscript.

    A range of voxel resolution on described on page 3 should be provided.

    There is a typo on the first sentence of the Figure 1 caption, “ Set of landmarks used in small the Synthetic Set”.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is interesting work albeit preliminary. This reviewer feels the authors should have focussed on one of the approaches and developed a more complete description and evaluation than reporting on two different approaches incompletely. In addition, the narrative presentation needs some revision as the long dense sentences sometimes make it difficult to understand what the authors intend to convey.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Two major potential concerns this reviewer had with regards to sensitivity of PCA, and an explanation for the difference in performance of the PCA and Q-value methods, have been addressed in the rebuttal. The remaining weaknesses are moderate



Review #4

  • Please describe the contribution of the paper

    The authors presented a framework to detect inner ear abnormality using deep reinforcement learning-based landmark localization. They used a score that combined the PCA shape distance and the Q-value history distribution to detect the abnormality.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The idea of combining (or simply separately using) the PCA shape distance and the Q-value history distribution to detect the abnormality is novel.
    2. The way to generate a synthetic set is also interesting.
    3. The paper is well-written.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Lack of comparisons to other methods. With enough amount of data, I was curious how a classification network would perform on this task.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors will not release the code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Add more comparisons to other methods. I was curious how a classification network would perform on this task.
    2. It would be interesting to further explore how to better combine different abnormality scores and make a better prediction. For example, the way to choose the weighting factor, other clues from the model output when the abnormality exists.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has a clear clinical motivation and brings some new ideas to detect the inner ear abnormality.

  • Number of papers in your stack

    1

  • What is the ranking of this paper in your review stack?

    5

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper makes a good contribution of the application of reinforcement learning for inner ear abnormality detection. The clinical problem is well motivated and the experimental validation as well as manual annotation is performed well with good detection performance. Some explanations are missing from the paper. The authors are suggested to closely refer to the reviewer comments, especially about the training regime, the shape distance definition. Since the author’s method also relies on PCA, it would be helpful to discuss whether the work by Amor et al. (2017) https://doi.org/10.1109/DISTRA.2017.8167682 can also be incorporated. This should at least be discussed.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5




Author Feedback

We thank the reviewers for the constructive and valuable feedback. Below, we will try to address the issues while still adhering to the strict paper size limits. Robust PCA: The paper by Amor et al (2017) is a very relevant reference dealing with robust PCA, online PCA and PCA based anomaly detection. It is, however, important to note that the PCA in our work is computed using the samples exclusively from the normative data (healthy anatomies) in a static dataset. Therefore, we consider all the datapoints used for the PCA analysis as non-corrupted. The use of a robust PCA is a nice addition, but not crucial. Preliminary experiments did not show significant differences on our data. Secondly, the Jackson-Mudholkar approach to threshold estimation in the squared-prediction error is very similar to our approach. We did not find significant differences in our experiments. We can add a small paragraph in the discussion section about this. Training: The training regime is end-to-end on a 12GB GPU using the ε-greedy search strategy with a forgetting factor, γ, set to 0.9 (empirically found). We use multi-scale with three isotropic resolutions: 0.9, 0.6, and 0.3 mm. The average training time was 6 days. The memory limitation in the GPU is the main reason why only one agent per landmark is trained for the 12-landmark against the 3 ones for the 5-landmark model. If needed, we can add two lines about this in the paper. Methods: The normalization of the Q-values is done using the last 10 states of an agent, these values are divided by the biggest Q-value of that agent in that run and the standard deviation is computed using those normalized values over all the runs, which is the uncertainty measurement of that landmark, u_n. To compute the uncertainty score of an image, U, the landmarks’ uncertainties are joined together by computing the norm of the vector containing the u_n of all the landmarks in that image. We check the uniformity of the Q-values distribution for each landmark independently and not overall the landmarks in one image because different landmarks have specific anatomical relevance. In future research, we aim to evaluate the possible explainability derived from knowing the uncertainty scores of each landmark, u_n, and how they correlate with the anatomical anomaly. The dji distance is the L2 norm between vectors bj and bi in the PCA space. We will modify the formula to make this clear. The weighting factor is computed among all the images of the training set, where both D and U values follow a rather normal distribution. Even though those are the values are computed in the training set, the evaluation is not a stochastic process due to the randomness of the initial position of the agent. Therefore, the weighting factor extracted from the training showed to be representative and well-fitted to join both measurements in the test set evaluation, as can be observed in Figure 5. Results: The difference in performance presented for the PCA and Q-value methods in fig 5e and 5f is indeed interesting. The reason is the different models. The C-MARL uses communicative layers that share explicit information between agents about their location and certainty. What was shown in the results is that this information produces a lower estimate for the agents’ uncertainty based on the Q-value distribution, U , meaning this extra information makes the agents more confident about their estimation even when an abnormality is present. When there is no explicit communication (MARL), the Q-values are based only on the appearance of the current state. Therefore, we concluded the MARL model is more suited for our approach to detect anomalies. We agree with the reviewers that a future comparison against a supervised method would be very interesting but it was out-of-the scope of this paper. Language: We will have a professional proofreading service on the paper before final submission to eliminate overlooked grammatical errors.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have addressed the concerns raised by the reviewers as well as the meta-reviewers. Especially as the relevance of the Amor et al. paper was brought up more than once. The application is novel and of great interest to the clinical imaging community.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    8



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Strenghts of the paper include an interesting approach to abnormality detection in the ear, which is well motivated, and a convincing validation. The primary concerns around use of PCA and sensitivity were addressed satisfactorily during rebuttal, suggesting that the paper can be accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Overall, the authors have carefully addressed the main concerns of the reviewers in the rebuttal. With the agreement between all reviewers offering positive scores, this paper shall be accepted to publish at MICCAI.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5



back to top