Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Kimberly Amador, Anthony Winder, Jens Fiehler, Matthias Wilms, Nils D. Forkert

Abstract

Predicting the follow-up infarct lesion from baseline spatio-temporal (4D) Computed Tomography Perfusion (CTP) imaging is essential for the diagnosis and management of acute ischemic stroke (AIS) patients. However, due to their noisy appearance and high dimensionality, it has been technically challenging to directly use 4D CTP images for this task. Thus, CTP datasets are usually post-processed to generate parameter maps that describe the perfusion situation. Existing deep learning-based methods mainly utilize these maps to make lesion outcome predictions, which may only provide a limited understanding of the spatio-temporal details available in the raw 4D CTP. While a few efforts have been made to incorporate raw 4D CTP data, a more effective spatio-temporal integration strategy is still needed. Inspired by the success of Transformer models in medical image analysis, this paper presents a novel hybrid CNN-Transformer framework that directly maps 4D CTP datasets to stroke lesion outcome predictions. This hybrid prediction strategy enables an efficient modeling of spatio-temporal information, eliminating the need for post-processing steps and hence increasing the robustness of the method. Experiments on a multicenter CTP dataset of 45 AIS patients demonstrate the superiority of the proposed method over the state-of-the-art. Code is available on GitHub.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_62

SharedIt: https://rdcu.be/cVRuN

Link to the code repository

https://github.com/kimberly-amador/Spatiotemporal-CNN-Transformer

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    In this paper, the authors propose to predict acute ischemic stroke (AIS) lesions 2-7 days after the stroke onset from Spatio-temporal Computed Tomography Perfusion (CTP). Instead of using estimated perfusion maps, the authors utilize the raw temporal CTP acquisitions, by proposing a new hybrid Convolutional Neural Network (CNN) and Transformer model. CNN encoders extract features from each time step, then the Transformer learns the temporal relations, and finally, a CNN decoder estimates the final lesion. The method is evaluated in an in-house dataset and improves over perfusion maps-based CNN and CNN-based temporal methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The text is well-written and easy to follow. In particular, this Reviewer appreciated the Introduction, with a literature overview and motivations.
    • Hybrid CNN-Transformer models [1] were previously proposed, but, to the best of the knowledge of the Reviewer, this particular model and its application to CTP is novel.
    • The use of Transformers for this purpose, as proposed by the authors, seems capable of leveraging the best of these models, i.e., learning temporal relationships in CTP.
    • The results are positive, improving over CNN-based naive baselines with perfusion maps, or even considering the temporal component.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors propose to have a CNN-based encoder to process each of the input time steps of the CTP. This comes with a computational load increase that is linear with the number of steps. Therefore, it poses limitations in terms of application, but also on the capacity of the CNN encoder.
    • A major weakness is the in-house dataset used to evaluate the proposed method. While the authors will release the code, it is hard for other methods to compare against the proposed work.
    • Evaluation raises some concerns. 1) while using 10-fold cross-validation can be good, it leads to the assumption that results are averaged from the fold used for validation, which may be overly optimistic. In other words, a separate test set is missing. 2) the authors crop the images such that only the brain hemisphere with the stroke is maintained, which may lead to overly optimistic results (artificially reducing False Positives), and being impractical in a real setting, where the annotations are not available to determine the hemisphere.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The method is well described, and the authors promise to release the code. Therefore, implementation-wise, the proposed work should be reproducible. The main concern is related to the dataset, which is in-house. In this sense, it is not possible to reproduce the final results. The Reviewer wonders if it is possible to release at least a small set of images, together with results and manual annotations.

    In summary, it should be possible to implement the proposed method adequately, but the final results cannot be reproduced.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    General

    In this paper, the authors tackle AIS lesion prediction directly from CTP raw images, i.e., without the conventional way of estimating perfusion maps. This is an important topic since such maps can discard information that may be learnable from a data-driven approach. The novelty of the work lies in the methodological approach, by using CNN and Transformers to extract and combine features from the multiple time-steps, and its application to CTP and AIS lesion prediction. The Reviewer considers the method novel, but there are concerns related to the evaluation. Please, check the detailed comments below.

    Comments/questions to the authors (not in order of importance)

    1) How many transformer blocks are used in the model? Could the authors clarify it, please? Also, it would be interesting to see ablation studies where this number is changed. That would also inform about how much we can gain from further temporal context aggregation.

    2) The proposed method is evaluated in an in-house dataset, which makes it hard for future work to be built upon or to compare results. There are public datasets that could be used, for example, ISLES 2017 [2] provides MRI perfusion data with the raw temporal acquisitions, and it could be used to compare with SotA. While it is not CTP, the proposed method should be possible to be applied in ISLES 2017 without modifications. Could the authors comment on why such a dataset was not used, please? a) Moreover, ISLES 2017 has a hidden test set, which would allow mitigating some of the evaluation concerns mentioned next. b) The authors state “sets a new state-of-the-art for lesion outcome prediction from raw 4D CTP data”. But, since the data is not publicly available, such a claim cannot be challenged or claimed so strongly.

    3) The proposed work raises some concerns about the evaluation procedure. a) The authors use 10-fold cross-validation for evaluating the model. This is a valid practice, given the size of the dataset. However, the Reviewer assumes that the reported metrics are the average of the validation fold of each run. In this case, results may be overly optimistic, since there may exist the risk of overfitting. Again, this is a valid method to find hyper-parameters, but having an external test set would be desirable. b) The authors crop the images such that only the brain hemisphere with stroke is kept, which poses 2 main issues. First, results may be overly optimistic because some potential regions of False Positive detections are removed before analysis. Second, it does not represent a fully automatic real-world scenario, where the affected hemisphere of the brain is not known since manual annotations do not exist. c) The authors detect a strong statistical difference in the results in terms of DSC when comparing with baselines. But, none to small difference exists when we consider volume. This is of course possible, but it would be interesting to see more discussion on why this is the case.

    4) The authors cite [3], but do not compare with. This work was proposed in the context of MRI perfusion but seems possible to be used for CTP, or at least partially. One of the contributions of [3] is to model the temporal aspects of the raw perfusion images as channels of CNN. This would be a simple, yet, valid CNN-only model to compare against.

    Further comments (suggestions/extra comments on future work) - NOT intended to be addressed during rebuttal

    • The method considers 2D slices of the CTP, which neglects the 3D nature of the images. It is understandable, however, that it is computationally demanding. Indeed, the authors will try to address it in future work. Besides the proposed future direction, it may be worth taking a look at methods that decrease the computational load of the 3D CNN, too, such as [4].

    • This is more of a detail. The authors wrote, “paired t-test with p<0.05”. The p-value is indeed used to check the statistical significance, but such a threshold is referred to as significance level, usually defined as alpha.

    • The Reviewer believes that the proposed method is interesting. Still, it would be good to see an extended version of the work that considers more validation, more datasets (including public ones), and a comparison with more SotA methods.

    References

    [1] Carion, Nicolas, et al. “End-to-end object detection with transformers.” European conference on computer vision. Springer, Cham, 2020. [2] http://www.isles-challenge.org/ISLES2017/ [3] Pinto, Adriano, et al. “Enhancing clinical MRI perfusion maps with data-driven maps of complementary nature for lesion outcome prediction.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2018. [4] He, Junjun, et al. “Group Shift Pointwise Convolution for Volumetric Medical Image Segmentation.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2021.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed work has technical novelty. The hybrid CNN-Transformer and application to AIS from CTP raw images are novelties. However, the evaluation raises concerns: 1) the data is not public, 2) results may be overly optimistic due to 10-fold cross-validation, 3) the authors crop the brain hemisphere of interest before analysis.

    Still, the MICCAI community could benefit from this work.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Thank you for the rebuttal and addressing some concerns.

    The authors address or clarify some of the concerns, such as comparison with other methods.

    Some limitations, such as cropping the affected hemisphere or lack of details in evaluation, are address by the authors, but they are not fully convincing. Still, the authors promise to add some discussion to the final manuscript.

    Therefore, this Reviewer maintains the same score as before - accept.



Review #3

  • Please describe the contribution of the paper

    The authors proposed to use spatio-temporal transformer to predict stroke lesion outcomes directly using 4D CTP images as input.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The usage of transformer on this specific problem can be considered novel. The manuscript is well written.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I found no major weaknesses.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors state that code will be released. The in-house data is unfortunately not available; it might be difficult to reproduce the presented results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • For fair comparison between methods, I believe all methods should chose their own optimal training settings, e.g., epochs, learning rate, batch size, rather than fixed for all methods as presented in this manuscript.
    • When it comes to spatio-temporal learning, commonly used approaches include the well-known LSTM/GRU, nowadays ConvLSTM and ConvGRU. I wonder why the authors opted for transformer over ConvLSTM/ConvGRU in the first place?
    • As the transform is the main technical contribution, I am curious how this would compare with ConvLSTM/ConvGRU, which is not presented in this paper.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The usage of transformer in this application is somewhat novel with outperforming results, although I still suggest to justify the choice of transformer over ConvLSTM and ConvGRU.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Although I keep my previous rating, I encourage the authors to present a comparison of Transformer based method with ConvLSTM/ConvGRU, even though there is a large chance that they perform equally good in my view.



Review #4

  • Please describe the contribution of the paper

    This paper proposes to segment the ischemic stroke lesions directly from 4D CT perfusion images, rather than from perfusion parameter maps. They used transformer for fusing the temporal information and use a CNN to extract spatial features.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1, Using 3D CT perfusion images for segmentation has not been well studied in the field. This topic is some kind of new for the community. 2, The paper is smoothly written and easy to understand. 3, The authors successfully showed that the proposed method outperformed segmentation from perfusion parameter maps.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1, Method in the novelty seems to be limited. The structure of the encoder and decoder are based existing works, and the authors also directly used the transformer for the outputs of the encoder. This makes me feel that the method is a simple combination of these existing methods. 2, The experiments are not sufficient. The authors only compared their method with TCN and U-Net. 3, Some important details of the method are not provided. The authors claimed that “full details of the proposed architecture can be found in the supplementary material”. However, I only see comparisons of model sizes and runtime in the supplementary material.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors missed some important details of the method, which hinders the reproducibility of the paper. However, they promised to release the code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    1, Some details should be provided: after the CNN encoder, what is the feature map size for each slice and how are they sent to the transformer and decoder? Let’s say there are T slices (at T time points) that lead to T feature maps, the output of the transformer will lead to T feature maps as well. How to combine the T feature maps obtained by the transformer with the decoder to obtain a single segmentation? 2, I think the comparison of the model sizes and average runtime do not take much space and they can be put in to the manuscript. 3, They authors should consider to compare the fusion strategy with some other methods. The transformer here is proposed for fusion the information from the T slices. Is there any alternatives for this purpose? Some comparisons are necessary. 4,In figure 3, it seems that the images had been skull-striped and cropped. However, such preprocessing was not described in the manuscript. 5, Does it really necessary to use the raw 4D CT perfusion images as input? What happens if using transformer-based network for the perfusion parameter maps?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I would like to reject this paper due to its limited novelty of the method and insufficient experiments.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    6

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    The authors’ rebuttal highlighted the method’s novelty. Despite that the authors used general transformer structures without much modifications in the encoder and decoder, it is of interest to the specific application. However, the concerns on the details/reproducibility of the method and experiments are not well addressed in the rebuttal. Therefore, I would change my score from 3 (reject) to 4 (weak reject).




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The work addresses an interesting problem and the application is new. However there are some concerns regarding the methodological novelty and its fair comparisons. The work is thus invited for rebuttal to address the major concern that the reviewers have raised, particularly on the technical novelty, the justification of the choice of transformer and compared models, as well as the particular used experimental settings.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9




Author Feedback

We thank all reviewers and the AC for their generally very positive feedback on our novel method for predicting (follow-up) ischemic stroke lesion outcomes from (baseline) 4D CTP imaging. Here, we focus on clarifying aspects mentioned by the AC and reviewers about the (1) technical novelty, (2) rationale of the Transformer and comparison models, and (3) experimental setup.

Our end-to-end trainable method combines elements of CNNs and Transformer to map a 4D CTP sequence to a single output segmentation (lesion outcome prediction). This novel approach allows us to learn rich spatio-temporal features directly from the 4D data. To do this, all frames are efficiently and simultaneously analyzed using a shared encoder. The resulting latent representations are used as input tokens to a Transformer to learn the complex temporal perfusion dynamics. The Transformer output is then aggregated via global max pooling and sent to a single decoder to make a prediction.

R4 mentions that our work makes use of existing methods. However, we would like to emphasize that: (1) Such an efficient spatio-temporal CNN/Transformer combination usable on high-dimensional 4D CTP data has never been proposed before. (2) Our quantitative results show that this combination outperforms a 3D CNN trained on perfusion maps and a spatio-temporal 4D CTP model recently published at MIDL2021, which we consider as the current state-of-the-art. Thus, and as highlighted by R2&R3, the novelty of our work lies in our particular model and its specific application to 4D CTP and stroke lesion outcome prediction.

Our motivation for using Transformer stems from its ability to learn complex short- and long-range relations within a sequence. Its inherent attention mechanism enables the model to simultaneously attend to all time points rather than just focus on neighbouring ones. This is a major advantage over the state-of-the-art method (MIDL2021), which uses a more restricted, less powerful, tree-like temporal convolutional network (TCN) as a temporal fusion strategy. Our results quantitatively confirm this theoretical superiority of the Transformer setup: our approach outperforms the TCN-based method from MIDL2021 by a large margin. While a ConvLSTM/ConvGRU (mentioned by R3) could be used for comparison, we refrained from doing so as there are no published methods for stroke lesion prediction using them. Besides, in other scenarios, TCNs have previously outperformed these methods in terms of performance and inference time. We also refrained from using the method proposed by Pinto et al. (mentioned by R2) in our evaluation as it is technically not fully comparable to ours (combines 4D data with perfusion maps) and is optimized for multi-modal MRI data, not 4D CTP. We believe these aspects make it hard to fairly compare both methods in a space-restricted MICCAI paper. While using the ISLES2017 MRI dataset would have been technically possible, we focused on using 4D CTP as this imaging modality remains largely unexplored for this application despite being clinically more widely available. In the future, we plan to optimize our method also for multi-modal MRI datasets.

We would like to clarify that even at the early stages of a stroke, information about the affected brain hemisphere is clinically almost always available (even without imaging). We believe this justifies the ipsilateral data cropping (mentioned by R2&R4) in this initial study, which was primarily done to reduce the model complexity. We will add a discussion of this potential limitation to the final version of the paper. Additional modifications to the paper will include clarifying aspects of the preprocessing (e.g., skull stripping), choice of evaluation scheme (e.g., 10-fold CV), and differences between spatial and volumetric accuracy of the results. Although we cannot share our in-house data, all implementations will be made available on GitHub and architecture descriptions will be added to the suppl. material.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The work addresses a new and interesting application of stroke lesion outcome prediction from 4D CT perfusion, which has not been investigated before. The rebuttal made further clarifications on the concerns reviewers have raised, though some details about the method are still missing. The argument about the novelty, though not very strong, can be acceptable considering the new application it tries to tackle. Thus it is recommended for acceptance due to its potential interest to the community.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    6



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This manuscript presented a deep learning approach to predict stroke lesions volumes from 4D perfusion CT scans. The dataset and application are quite novel and interesting to the MICCAI community, although several reviewers raised concerns on the novelty of the technical approach (R2,R4) and some missing details on the method implementation (R2).

    However, overall the merits of the manuscript, which include a reasonable approach and appropriate evalaution, outweight the weaknesses.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The proposed transformers-based framework for spatio-temporal analysis of Computed Tomography Perfusion images was found relevant and sound. The rebuttal was positive in addressing the questions of the reviewers, mostly about motivation and comparison to the related methods, and experimental setting. While some limitations should be still clarified in the final version of the manuscript, overall the paper was found intersting for acceptance to the conferene.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



back to top