Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Syed Zulqarnain Gilani, Naeha Sharif, David Suter, John T. Schousboe, Siobhan Reid, William D. Leslie, Joshua R. Lewis

Abstract

More than 55,000 people world-wide die from Cardiovascular Disease (CVD) each day. Calcification of the abdominal aorta is an established marker of asymptomatic CVD. It can be observed on scans taken for vertebral fracture assessment from Dual Energy X-ray Absorptiometry machines. Assessment of Abdominal Aortic Calcification (AAC) and timely intervention may help to reinforce public health messages around CVD risk factors and improve disease management, reducing the global health burden related to CVDs. Our research addresses this problem by proposing a novel and reliable framework for automated “fine-grained” assessment of AAC. Inspired by the vision-to-language models, our method performs sequential scoring of calcified lesions along the length of the abdominal aorta on DXA scans; mimicking the human scoring process.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_42

SharedIt: https://rdcu.be/cVRuq

Link to the code repository

https://github.com/NaehaSharif/Show-Attend-and-Detect

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a framework for automated assessment of AAC (Abdominal Aortic Calcification). There are a few methods which predict an overall AAC-24 score, however they have some shortcomings. To address them, an effective framework is proposed to generate fine-grained scores of images in a sequential manner. It utilizes an attention based encoder-decoder network to mimic the human-like AAC-24 scoring method. According to the authors, this is the first time such a methodology is used to address the AAC-24 scoring problem.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Cardiovascular Disease (CVD) is the main cause of death globally, and it contributes to disabilities significantly, too. When calcium deposits in arteries, Vascular calcification happens, resulting in heart attacks or strokes. The abdominal aorta is one of the first vascular beds where calcification is seen. Since AAC happens well before clinical events, there is a chance to identify people at risk and intervene in a timely manner before they suffer cardiovascular events. The main strength of this paper is the usage of an attention based encoder-decoder network to mimic the hAAC-24 scoring. Compared to the 3 previous works found in literature, the proposed methodology outperforms them. Further, it can classify patients into the three risk categories (low, medium and high), with certain levels of accuracy, sensitivity, and specificity. The positive aspect of this work is that it has achieved satisfactory outcomes for a small dataset of 1,916 scan. If a considerably larger dataset (with labels) can be found, the results could be justified better.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Nothing to be specific as weaknesses, however, some strategies to achieve better classification outcomes with some more enhanced results for the evaluation metrics (accuracy, sensitivity, and specificity) could have been suggested. It says that the AAC-24 scores are highly correlated with expert assessments with an accuracy above 80%. How can this be improved, too?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    None

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    A good piece of work which has some future work left.

    1. Applying the proposed algorithm on a larger dataset.
    2. Testing another option rather than using a pre-trained network.
    3. Explore how the difference between expert assessment and the outcome of the proposed method can be reduced.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper has explored a proposal on how automated assessment of AAC (Abdominal Aortic Calcification) could be done. The usage of an attention based encoder-decoder network is something novel when compared to the 3 related works highlighted in the paper. Further, it assists severity classification as well. Tough the accuracy is not that high, the idea has the potential for future expansion.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The authors present a model for automatically assessing aortic calcification on DXA scans using an LSTM with attention. Their model has the benefit of providing individual scores for regions of the aorta, which is potentially useful for diagnosis as well as understanding model output. They compared against their implementation of a previous model and showed higher performance metrics.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The approach of using an LSTM to treat the problem as a sequence is interesting and generally follows clinical practice. Having score breakdowns can be a major advantage for model explainability, which is necessary for eventual clinical implementation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    One significant weakness is that the results presented here are significantly lower than the previous publication of the state-of-the-art model they are training against, despite using the same model. The authors suggest that the previous publication may not have performed an appropriate evaluation, but the explanation here is lacking. It would be helpful to show a replication of the previous results and clearly demonstrate why those results are not a realistic representation of the model’s performance.

    The authors also weaken the justification of their sequential approach (the main innovation in their model) when they state that the individual scores are rigid and random. It seems that the authors were just trying to contrast their problem with language since, in later descriptions, it sounds like their is structure in the sequences, so it does seem like there is value in this approach. It would just be good to be descriptive of how the structure of these scores can be modeled using a sequential approach and to stay consistent.

    Finally, there are several mistakes and inconsistencies in the results that make things a bit hard to follow or appreciate.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The evaluation uses a public dataset and the model is clearly described.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    First, I’ll elaborate on my first point in the weaknesses. The authors note the reported results in the previous publication, but write them off by saying they “report their best results.” It is unclear what is meant by this statement; in most cases, we want to present our best results in a publication. If the authors are suggesting that the previous paper might have cherrypicked a beneficial test/train split or random seed, then it would be helpful for the authors to replicate this result (e.g., demonstrate the range of possible results using the previous publications methodology and show that the reported value is the top of the range). This would demonstrate the limitations in the previous paper and show that the baseline model is implemented correctly.

    I’ll also list the issues/errors I have found with the evaluation here.

    1. The metrics reported in table 2 are computed one versus rest and then averaged, which can be deceptive in a three-class problem. The accuracies here are substantially lower than the actual three-class accuracy, which can be directly calculated from the confusion matrices in figure 3 (about 73% for the proposed model and 56% for the baseline).
    2. There aren’t any confidence intervals or statistical tests showing the stability of the results or the significance of and differences.
    3. The second example AAC24 score in figure 1 is calculated incorrectly. The value should be 2, not 0.
    4. The confusion matrices in figure 3 are transposed. As displayed, the counts for the ground truths do not match the breakdown in the body.
    5. The metrics for the proposed model in table 2 do not match the confusion matrix in figure 3. Most are close, but do not match exactly (for example, the Moderate PPV should be 37.5 not 40).
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the way the problem is framed is interesting and producing individual scores can provide clinical value. The limitations in the evaluation as well as the discordance with the publication for the baseline model lower my enthusiasm.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #2

  • Please describe the contribution of the paper

    In this paper, a framework for automated fine scoring of Abdominal Aortic Calcification using L1-L4 vertebral X-ray scans is described. The authors propose the use of a convolutional encoder to obtain a latent representation of the scans, and then train an attention mechanism to focus separately on the anterior and posterior segments of the abdominal aorta. The resulting output provides a fine-grained scoring, based on the AAC-24 scale, which is an improvement over previous efforts that only provided a global score for a given X-ray scan.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Improvement over previous methods, which only provide a global AAC-24 scoring classification per exam/image
    • Very thorough explanation of the clinical significance of the research
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper needs to be proofread again, as there are a lot of typos and mistakes
    • Results validation could be improved and the reported results are not very convincing
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors will publish the trained model and their code upon acceptance. However, some details should be disclosed in the paper, such as the number of training epochs. Also, a citation was provided where the same dataset was used, which does not appear to be published.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Section 4: A few important details of the implementation were not explained, which hinders a full comprehension of the model and of the obtained results. For example: – how many training epochs were performed? – what was the threshold correlation value used in early stopping? – which layers used dropout regularization and what was the dropout alpha?

    • Section 4.2: Data augmentation transformations should be applied carefully to medical images. Shear transformations are not always desirable, as they may distort spatial relations between anatomical structures. Also, the authors should clarify further the data augmentation ranges, e.g., “[-10, 10] pixels”

    • What was the reasoning behind the choice of ResNet152v2? Did the authors test other backbone architectures? The results analysis could be improved with a comparison of different CNN encoders.

    • The paper need to be thoroughly proofread. Some minor issues include: – page 1, line 16: add a comma after “out of these” – page 2: define AAC-24 before in the text (e.g., after “Kaaupila 24-point scoring method)”. There is a definition later on section 2.1. – section 2.1: no need for “rd” after 1/3 and 2/3. Also, if the authors find the space for it, please rewrite the third phrase in 2.1, using “more than”/”less than”. – section 2.1, 2nd paragraph: repeated “the” after “Furthermore,” – In Fig.1b, there seems to be a typo. There is no calcification (as described in the text), but the left column of AAC-24 reports a 2 for L3 Ant, and the score is 0. – section 2.2: after [3] -> “followed” and consider removing “as [4]” – section 2.2, line 4: add comma after “as a whole” – ROI acronym is not defined. Also, it can be used in section 4.2 – near the end of section 2.2: “scores AAC-24” -> “AAC-24 scores” – section 3, line 3: “prepossessed” -> “pre-processed” – section 3, line 6: “conventional” -> “convolutional” – section 4, line 1: remove “of” after “comprises” – section 4.3, line 17: start sentence with “In each fold,” – section 4.4, line 2: “as compared to” -> “, instead of a” – section 4.4, line 4: “sum up all” -> “sum of all” / remove comma after “Since” – section 4.4, line 7: “comprises of” -> “consists of” – section 4.4, line 15: remove “one each” – “vertebra” is the single form, “vertebrae” is the plural form

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is quite interesting, with a fair degree of novelty, and its application purpose is of great importance. However, some of the presented results are not convincing enough (e.g., correlations with human scores in table 3 are very low). Moreover, result validation could be a bit more extensive and the paper suffers from a lack of clarity at some points.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #4

  • Please describe the contribution of the paper

    This work presents a deep learning approach for grading abdominal aortic calcification on DXA scans in a spatially localized manner. The proposed network demonstrates good correlation with human scoring and compares favorably to an existing deep learning approach for global calcification scoring.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • The clinical application is well suited for deep learning, as manual scoring of aortic calcification is time consuming. • Granular scoring near different lumbar regions is carried out, rather than generating only global calcification scores. • Automated granular scoring with the proposed method demonstrates strong correlation with human scoring. • The use of figures for illustrating important points is well done considering space limitations.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • The mathematical framework presented in Section 3 is difficult to follow. • The comparison to an existing deep learning method is interesting and relevant, though not ideal since modifications to the existing method were made. • The intuition of applying principles from the vision-to-language domain is somewhat unclear given the differences between the clinical application and language processing. • There is a small dataset given the granularity of the task.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Okay

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please refer to the strengths and weaknesses above. It would help to provide additional information on how stratified cross-validation was carried out.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There is an interesting clinical application with a reasonable evaluation and promising results. However, the explanation of the proposed network architecture could be improved.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Strenghts: Relevant clinical problem, and solution that is both novel from a technical perspective and relevant from a clinical perspective, producting individual scores/interpretability, use of figures.

    Weakness: Explanation for the results being lower than previous publication is unclear/weak, also the comparison is not ideal since modifications to the existing methods were made, inconsistencies in the results (R3), explanation of architecture could be improved

    In the rebuttal, the authors should mainly focus on clarifying the results and the comparison with the baseline method.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6




Author Feedback

We are grateful to the reviewers for acknowledging the clinical relevance of our problem and the novelty of our solution.

R3 & R4 commented on the comparison with [15], (R3 suggests we imply their results have been “cherry picked”). What we meant was that the authors of [15] did not perform cross validation, rather reported their results from a single train/validate/test run and selected the network (out of a battery of CNN models) that gave them the best results ([15] Page-3. First two paras). We also know (private communication with the authors of [15] that “kfold validation wasn’t used” and that “the test results are slightly higher because the test data had a slightly higher number of high-rated scores.” We will rephrase our statement in Sect 2.2 to reflect the above and apologise that this was not done in the original manuscript. Note: their mean accuracy was 70.7+ 3.2% while ours is 81.98+2.5%. The updated results with STD will be mentioned in the results section. Furthermore (to R4), we inevitably differ from their network as the authors of [15] did not share the critical information required to exactly replicate their results. Further, [15] was limited to generating a single global score from an image while we learn granular as well as global scoring, which is of clinical value and also adds explainability (a highly desirable feature when it comes to medical domain).

R3 (comment 4&5): We sincerely apologize and acknowledge that there was some error in calculating the metrics for our model in Table-2, which resulted in under-reporting our results. The correct (transposed) confusion matrices and the corresponding results (one-vs-rest) are below. Note that the correlation was not affected by this. We will replace them in the camera-ready version.

BASELINE

Predicted GT low med high low 460 308 61 med 132 264 49 high 52 238 352

OURS

Predicted GT low med high low 716 85 28 med 201 167 77 high 21 106 515

LOW MED HIGH Mean

M(base) OURS M(base) OURS M(base) OURS M(base) OURS Acc- 71.14 82.52 62.06 75.52 79.12 87.89 70.77 81.98 Sens- 55.49 86.37 59.33 37.53 54.83 80.22 56.55 68.04 Spec- 83.07 79.58 62.88 87.02 91.37 91.76 79.11 86.12 NPV 70.99 88.45 83.63 82.16 80.06 90.20 78.23 86.93 PPV 71.43 76.33 32.59 46.65 76.19 83.06 60.07 68.68

R3 also pointed out the “one-vs-rest and 3-class accuracies/metrics” and lack of confidence intervals. As the reviewer has mentioned, both accuracies can be calculated directly from the confusion matrices. We opted for one-vs-rest to make it easy to compare (fairly) with [15]. However, we will now report both one-vs-rest and 3-class accuracies in the results section. Our 3-class accuracy is 72.8 + 2.9% while that of the baseline is 55.8 + 3.2%. The Pearson and Kendal correlation, p«0.001 shows statistical significance. We will add these to the results. R2 asked for some implementation and augmentation details: We trained the networks till we achieved the highest correlation and then for 50 more epochs (Average epochs = 100). The first dropout was applied after the hidden layer of LSTM (alpha=0.5), then another (alpha= 0.4) before the last FC layer. We agree that shear transformations are not always desirable. However, to accommodate different spine structures in training we applied a very small shear transformation of [0.01,0.05] degrees. Translation and scaling were a multiplication factor of width and height, while shear and rotation were in degrees. This info will be added in the paper. Thank you for pointing out the language errors which have been fixed.

R3 ….. using vision-to-language models. We know that the growth patterns of AAC follow a sequential model [12] and humans read these scores sequentially. Hence, we found it intuitive to use a sequential language model to learn the “language of AAC”.

We are grateful to R1 for his suggestions.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The main reviewer points (clarifying the results and the comparison with the baseline method) are well-addressed.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    There is a 50/50 split on this. Unfortunately some of the errors in the original paper might have damaged the paper. I like the formulation of the method in the paper and find it novel. I am inclined to accept.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper may be the first work on how automated assessment of AAC (Abdominal Aortic Calcification) could be done by adapting sequential attention-based models from vision language domain to address the challenge of fine-grained AAC-24 scoring. This is a clinically meaningful/important work. The technical novelty is acceptable and moderate. Experimental results are sufficient and promising using a dataset of 1,916 low resolution DXA scans.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



back to top