Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman

Abstract

This paper proposes a novel transformer-based model architecture to solve medical imaging problems involving analysis of vertebrae. It also considers two applications of such models: (a) detection of spinal metastases and the related conditions of vertebral fractures and cord compression (b) radiological grading of common degenerative changes in intervertebral disks. Our contributions are as follows: (i) We propose Spinal Context Transformer (SCT), a deep-learning architecture suited for the analysis of repeated anatomical structures in medical imaging such as vertebral bodies (VBs). Unlike previous methods, SCT considers all VBs as shown in all available image modalities together, making predictions for each based on the context from the rest of the spinal column. (ii) We apply the architecture to a novel but important task - detecting spinal metastases and related conditions of cord compression and vertebral fractures/collapse from multi-series spinal MR studies. This is done using annotations extracted from free-text radiological reports as opposed to bespoke annotation. However the model shows strong agreement with vertebral-level bespoke annotations from a radiologist on the test set. (iii) We also apply SCT to an existing problem - radiological grading of inter-vertebral discs (IVDs) in lumbar MR scans for common degenerative changes. We show that by considering the context of vertebral bodies in the image, SCT improves the accuracy for several gradings compared to previously published models.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_26

SharedIt: https://rdcu.be/cVRtc

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #3

  • Please describe the contribution of the paper

    This paper presents spinal context transformer, a deep learning architecture considering context from multiple sequences and neighboring vertebrae, for spinal cancer detection and radiological grading. In this study, the training labels of the spinal cancer detection task was obtained from free-text radiological reports, avoiding the necessity of annotating the images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This study considers context from multiple sequences and neighboring vertebrae for for spinal cancer detection and radiological grading, which is interesting.

    2. This study demonstrates the potential of obtaining supervision from free-text radiological reports.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The structure of attention mechanism (in Fig. 1b) was not described in the paper, therefore it is unknown how the attention is achieved.

    2. More information of the baseline models has to be provided to evaluate the fairness of comparison. For example, do they operate on multiple sequences? In what aspects that they are good choice of baseline? In Table 3, instead of SpineNet (T2) and Baseline: SCT Encoder (T2), it would be fairer to use SpineNet (T1, T2) and Baseline: SCT Encoder (T1, T2) for comparison in Table 3.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The dataset for spinal cancer detection is private. No experimental codes are provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Please describe the structure of attention mechanism and explain how the attention is achieved.

    2. Please consider to justify or change the baseline models.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of this paper is the design of SCT and the use of information from free-text radiological reports as supervision, both of which are interesting. The main problem is the baseline models, which might be unfair when compared with the proposed approach.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    The paper describes a method for vertebrae analysis using an algorithm relying on transformers. The authors focus in particular on vertebrae classification using multiple MRI sequences. Particular attention is drawn to weak supervision, as the labels are extracted from clinical reports. The authors perform the evaluation on the Genodisc dataset and compare the method to some state-of-the-art works. Moreover, some ablation is performed regarding the contribution of the sequences being used.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper focuses on clinically relevant task and does that in the technically appealing context, that of using transformers, and within weak supervision making it of interest for the medical imaging community overall. The proposed evaluation is sufficiently broad to allow for comprehension and judgement. Moreover, the paper is nice to read as it is quite clear and well exposed, so the reading flows. The provided illustrations contribute to the understanding of the paper.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors introduce a transformers-based algorithm, which is of raising interest in the community. However, there is few discussion of the role the transformers are playing in the performance increase, that might lead to uniformed conclusions.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors provide some reasonable amount of details on the dataset definition and its preparation. They also provide details on the training setup. Some pseudo-code is provided in the supplementary material. The encoder is introduced as ResNet18.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Method In Fig. 1 the authors introduce STIR and FLAIR sequences, while later, in experiments, they talk about T1 and T2. Could the authors revise it for consistency?

    Experimental results

    • At this stage, it might be useful to remind the encoder being used (ResNen18 as it stands from 3) and the way it was trained.
    • Results: it appears that the SCT (T1, T2) has lower performances than T1 or T2 alone. Could the authors comment on that?
    • Results: In table 2 the authors show the performances compared to the expert and report annotations with slightly different trends (e.g., the baseline outperforms the proposed method in fractures classification). Could the authors provide more details on how the tables should be read and how the results could be interpreted?

    Discussion

    • I wonder, whether the T1+T2 trained method would require both sequences available at test time which might be a limitation of the method? Could the authors comment on that?

    Overall:

    • I would suggest revising the format and better use of subsections and paragraphs, as some of the sections appear to be a bit lengthy.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well written and exposes well the proposed approach. The topic could be of interest to the general public as it could raise discussion of the interest of transformers as a promising tool. There might be a few minor improvements that could be done before moving forward (see detailed comments.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #5

  • Please describe the contribution of the paper

    The paper proposes a Spinal Context Transformer (SCT) for a variety of spine-related tasks in multi-series spinal MR scans. It also proposes strategies to use annotations derived from reports. Experiments on two datasets show improved accuracy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper proposes to use transformer layers and attention mechanism to fuse features from multiple MR slices, series and vertebrae, which is intuitive and effective. The use of labels extracted from reports is also economical.
    2. Comprehensive experiments are done on muliple spinal tasks.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The proposed method is only compared with one existing method published in 2017. There should be more comparisons with existing methods, as well as more ablation studies to assess each component in the method. Besides, in Table 3, SCT only improved SpineNet by 0.7% in average accuracy with T2 images.
    2. The content is somehow too much for this 8-page paper, so there are little space for more comparison and ablation studies. I think the introduction can be trimmed a bit.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The description of the method is satisfactory but not perfect, possibly because the algorithm has many parts but the space is limited.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. Please clarify if the method needs manual annotation of vertebra levels.
    2. What is balanced accuracy in Table 3?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Reasonable and novel method with comprehensive experiments.

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper recevied three consisent review; Given all conditions as now, this paper is recommended for Provisional Accept. Please address adequately for reviewers’ constructive comments and suggestions:

    “The novelty of this paper is the design of SCT and the use of information from free-text radiological reports as supervision, both of which are interesting. The main problem is the baseline models, which might be unfair when compared with the proposed approach.”

    “Method In Fig. 1 the authors introduce STIR and FLAIR sequences, while later, in experiments, they talk about T1 and T2. Could the authors revise it for consistency?

    Experimental results

    • At this stage, it might be useful to remind the encoder being used (ResNen18 as it stands from 3) and the way it was trained.
    • Results: it appears that the SCT (T1, T2) has lower performances than T1 or T2 alone. Could the authors comment on that?
    • Results: In table 2 the authors show the performances compared to the expert and report annotations with slightly different trends (e.g., the baseline outperforms the proposed method in fractures classification). Could the authors provide more details on how the tables should be read and how the results could be interpreted?

    Discussion

    • I wonder, whether the T1+T2 trained method would require both sequences available at test time which might be a limitation of the method? Could the authors comment on that?

    Overall:

    • I would suggest revising the format and better use of subsections and paragraphs, as some of the sections appear to be a bit lengthy.”
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6




Author Feedback

Thank you to all reviewers and the meta-reviewer for their constructive comments. The feedback will improve the paper’s quality. We start by addressing points raised by several reviewers, then discuss individual concerns.

Firstly, reviewers expressed confusion about baselines used in the radiological grading application; SpineNet and the SCT encoder. SpineNet (Jamaludin et al. 2017) is an existing method for grading using T2 sequences only. As far as we are aware, these were the most recent published results on the Genodisc dataset at the time of submission. However, a technical paper has since been released with a new version, SpineNetV2 (Windsor et al. 2022). These new results will be included in the camera-ready - note SCT still outperforms SpineNetV2. Directly comparing to other research on automated grading (e.g. DeepSpine, Lu et al. 2018) is challenging as we do not have access to the datasets/models used.

The SCT Encoder baseline is the model shown in Figure 1b) trained independently. This model operates on a single vertebra from a single sequence. Adapting it for multiple sequences is non-trivial; concatenating embeddings from each sequence or treating each as a separate input channel would likely improve performance on Genodisc. However, this would not adapt to other datasets where subjects do not have the exact same sequences (e.g. subject 1 has T1 & T2, subject 2 has T2 alone, subject 3 has T1 & STIR, etc.). A major strength of SCT is that it deals with such datasets naturally. We will make this clearer in the final draft, and will add to Table 3 a baseline model which averages predictions from the T1 and T2 sequences together.

On a related note, we agree with R#4 that Figure 1a) is confusing since it shows STIR and FLAIR sequences. The reason for this is the spinal cancer dataset has a variety of different sequences for each subject as already discussed (in this case STIR & FLAIR) whereas Genodisc strictly has one T1 image and one T2 image for each patient. We apologise for not making this clearer; we will clarify this in the Figure’s caption and also in the main text.

Individual Concerns/Comments: R#3: Attention Mechanism Structure: The attention pooling mechanism involves feeding the embedding vector for each sequence element into a simple 2-layer MLP. This outputs a scalar attention score for each element. A softmax is then calculated across all scores from all elements. Resulting values are used to calculate a weighted average from each sequence. Examples of such attention scores can be seen in Figure 2 of the supplementary. For transformer layers, a standard attention mechanism is used as outlined in Vaswani et al. 2017. We will clarify this in the final paper. R#4: T1&T2 performance vs. single sequence: In some grading subtasks, SCT using T1 & T2 does slightly worse than those using T1 or T2 alone. In such cases, the performance differences are very small (<=1.1%) and likely explained by three factors: (1) Some grading schemes use information from a single sequence - we should not expect adding additional sequences will always improve performance here. We have done some preliminary experiments using 2-channel images which supports this (2) Label Noise: there is a degree of subjectivity to many grading schemes & inter-reader agreement is not perfect. Thus, tiny differences in agreement with expert annotations do not necessarily indicate a better-performing model. (3) There are multiple subtasks and optimising a model to perform well across all of them is challenging. However, note the average performance of the T1&T2 model exceeds that of single sequence models. Test-time sequences: SCT does not require T1 & T2 or any other specific set of sequences at test time, provided they appear in the training data. R#5: Level Annotation: We use levels detected by an existing automated method. We will clarify in the camera-ready. Balanced accuracy: For each subtask, this is the average recall across all classes



back to top