Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yiwen Zhang, Chengguang Hu, Liming Zhong, Yangda Song, Jiarun Sun, Meng Li, Lin Dai, Yuanping Zhou, Wei Yang

Abstract

Early screening is an important way to reduce the mortality of hepatocellular carcinoma (HCC) and improve its prognosis. As a noninvasive, economic, and safe procedure, B-mode ultrasound is currently the most common imaging modality for diagnosing and monitoring HCC. However, because of the difficulty of extracting effective image features and modeling longitudinal data, few studies have focused on early prediction of HCC based on longitudinal ultrasound images. In this paper, to address the above challenges, we propose a spatiotemporal attention network (STA-HCC) that adopts a convolutional-neural-network–transformer framework. The convolutional neural network includes a feature-extraction backbone and a proposed regions-of-interest attention block, which learns to localize regions of interest automatically and extract effective features for HCC prediction. The transformer can capture long-range dependencies and nonlinear dynamics from ultrasound images through a multihead self-attention mechanism. Also, an age-based position embedding is proposed in the transformer to embed a more-appropriate positional relationship among the longitudinal ultrasound images. Experiments conducted on our dataset of 6170 samples collected from 619 cirrhotic subjects show that STA-HCC achieves impressive performance, with an area under the receiver-operating-characteristic curve of 77.5%, an accuracy of 70.5%, a sensitivity of 69.9%, and a specificity of 70.5%. The results show that our method achieves state-of-the-art performance compared with other popular sequence models.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_51

SharedIt: https://rdcu.be/cVRuC

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    This paper focuses on early prediction of hapatocellular carcinoma (HCC) based on longitudinal ultrasound images. The authors propose a spatiotemporal attention network that adopts a convolutional-neural-network-transformer framework. Their method achieves state-of-the-art performance compared with other popular sequence models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This paper focuses on early prediction of HCC based on longitudinal ultrasound images. The authors claim that few studies have focused this topic and I also could not find similar studies.
    • The authors made a large sized dataset of longitudinal US examination for HCC including 619 subjects (although the dataset is not public).
    • The proposed method achieves better performances compared with popular sequence deep learning models such as LSTM, BiLSTM, GRU and vanilla Transformer.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The proposed method lacks novelty in some technical aspects. The proposed method is almost a combination of existing technologies. The core idea of the ROI attention block is the same as the method originally proposed in [1]. In addition, the transformer encoder used in the proposed method has the same architecture as the original transformer [2]. The age-based position embedding in the proposed method has some originality (and experimental results show that it can improve performance), but it is not novel from a technical point of view.

    [1] Wang et al., “Non-local neural networks”, CVPR2018 [2] Vaswan et al., “Attention Is All You Need”, NIPS2017

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • The experimental conditions in terms of reproducibility are well described.
    • The authors provide source codes for training and evaluation.
    • The dataset is not public.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    [Major comments]

    • The main drawback of the proposed method is its novelty in technical aspects. The authors mention the difference between the proposed method and nonlocal attention in section 2.3, but the difference might be marginal. Please provide clear and additional explanation on the novelty of the proposed method.
    • Some experimental conditions are not clear. (1) Why do the authors set N to 3 ? What does it mean from the clinical benefit point of view ? (2) Were the ultrasound images captured using the same ultrasound diagnosis machine ? If not, are there any impact on the performance ? (3) The ablation studies include “(iii) STA_HCC without the age-based PE (w/o age-based PE)”. How do the authors define PE in this case ? I think the authors do not use the PE used in the vanilla transformer since the experimental results shown in Fig. 3 and Table 1 are different.

    [Minor comments]

    • The term “longitudinal” is confusing. In medical ultrasound images, the term “longitudinal” uses to indicate the direction of the image plane in general. It might be good to provide a note about the definition of the term.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the technical novelty of the proposed method is marginal, the method has some impact on early prediction of HCC based on longitudinal ultrasound images, which has clinical benefit and is a new application in MIC.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents a CNN-transformer network for hepatocellular carcinoma (HCC) diagnosis from longitudinal ultrasound. The authors incorporate an ROI attention block into the CNN to perform feature extraction. They also use an age-based positional encoding to process longitudinal imaging of arbitrary time intervals.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The ROI attention block and age-based encoding are simple and sensibly designed. Good baselines and ablations. Other researchers working on classification tasks from longitudinal imaging data can draw inspiration from this work.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No glaring weaknesses, although comparison to radiologist performance and analysis of failure cases would help demonstrate clinical utility. I am also curious how much radiologists actually rely on previous scans to perform a diagnosis, and how much this model does. Is it sufficient to just show the latest + current scan to get good accuracy? What about the current scan alone?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Private dataset and code.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Usually I understand ROI to mean bounding box. Consider renaming “ROI attention” to “spatial attention”
    • My understanding is the “Transformer [18]” baseline does not use age-based PE? It seems that your model w/o age-based PE does worse than [18], which suggests that maybe [18] could outperform your model if you just add age-based PE.
    • Would be a little stronger with multisite data.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a simple yet incremental approach that could be useful for a wide range of longitudinal image classification tasks (although the paper only demonstrates it on the task of HCC diagnosis from ultrasound). The experiments are clear and fairly convincing.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper presents a new early HCC prediction method using a spatiotemporal attention network (STA-HCC) based on longitudinal US images. The topic is especially important in the case of 1) defining regions of interest (ROIs) in longitudinal US images and 2) sequential images and irregular temporal components. The proposal uses non- longitudinal US images to predict HCC. The motivation of the paper is clearly defined, and a brief state-of-the-art presented. Paper contributions: 1) ROI attention block, 2) age-based position embedding in transformers.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The task investigated in the paper is clinically important.
    2. It is interesting to try to include age-information features in the position embedding.
    3. The figures are illustrative.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The experimental evaluation should be clearer.
    2. The writing quality of this paper is not satisfactory.
    3. There are sentences which require some references.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Paper is clear enough that an expert could confidently reproduce

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. There are major weaknesses in the evaluation of the method: The information of and hepatocellular carcinoma is kind of localized in MRI slice. Compared with reference [22], why do you think taking advantage of the gating mechanism would help the localize focal areas? Could you add more experiment to show that the necessity of ROI attention in this task? For examples, adding results without sigmoid or ROI attention in Fig.4 to intuitively show the improvement of location.

    2. There are sentences which require some references. -In page 2, …. (US) is currently the most common imaging modality for diagnosing and monitoring HCC “ requires a reference. -In page 8, ” bidirectional LSTM (BiLSTM) “ requires a reference. -In page 6, “We chose SE-ResNet50”…. requires a reference.

    3. Typos:

      • The dataset collected from…—-» The dataset was collected from… -.. track and focus the same lesion…—-» track and focus on the same lesion… -..used to establish a appropriate positional…—-» used to establish an appropriate positional…
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper applied transformers to the clinical task, which looks interesting. However, the ROI attention block proposed in the paper has not been effectively verified. To be precise, the experiment can not prove the superiority of ROI attention block compared to nonlocal attention [20].

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper presents a CNN-transformer network for hepatocellular carcinoma (HCC) diagnosis from longitudinal ultrasound images. The proposed method uses an age-based positional encoding to process longitudinal imaging with arbitrary time intervals.

    The reviewers think this paper is clinically important for HCC early detection from ultrasound. The work is validated on a large private dataset with 619 subjects and compared with other popular deep learning models for sequence data analysis, including LSTM, BiLSTM, GRU and vanilla transformer.

    The reviewers also identified several weaknesses, such as

    • The proposed method lacks novelty in some technical aspects. The used techniques seem to be from the original works presented by other papers. The reviewers also acknowledge that introducing and adapting such methods into this specific clinical task is valuable.
    • Some experimental evaluation and conditions are not very clear, which could be improved.
    • Explanation of the current radiologist workflow can help improve the motivation of the method. Like R3 asked, “Is it sufficient to just show the latest + current scan to get good accuracy? What about the current scan alone?” That is a valid question, which may be explored by the authors.
  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3




Author Feedback

We thank the ACs and reviewers for their recognition of our work. In response to the concerns of the ACs and reviewers, we respond as follows:

Q1: The proposed method lacks novelty in some technical aspects. (AC, R2) A: Although our proposed ROI attention block looks similar to the Non-local block [20], they still have some differences. Non-local block multiplies the VALUE feature map with the attention map to get the residuals, which is finally added to the input feature map. The attention map in our ROI attention block element-wise multiplies directly with the input feature map, which is more like a gating mechanism. This design has stronger constraints forcing our ROI attention block to focus on valuable regions. In addition, all channels of the input feature map share a unique spatial attention map, which facilitates visualization of attention and improves interpretability. Normalized pooling is also a simple and ingenious design ensure a consistent distribution of feature values.

Q2: Why do the authors set N to 3? (R2) A: On the one hand, using 3 years as the prediction time, it is possible to screen and diagnose liver cancer at a relatively early stage of development (such as high-grade atypical hyperplastic nodules and very early hepatocellular carcinoma nodules) and receive early ablation or surgical treatment in order to obtain a better prognosis. On the other hand, the mean follow-up among the patients included in this study was 5.8 years (with a minimum follow-up of no less than 3 years), and if a model was chosen to predict the risk of developing liver cancer within 5 years (or even longer), the number of images that could eventually be included in the analysis would be significantly reduced and would ultimately affect the predictive performance of the model.

Q3: “STA-HCC w/o age-based PE” in Fig. 3 and “Transformer [18]” in Table 1. (R2, R3) A: “STA-HCC w/o age-based PE” in Fig. 3 does not use any position embedding, while “Transformer [18]” in Table 1 use conventional sinusoidal function with the position order of tokens as input.

Q4: Is it sufficient to just show the latest + current scan to get good accuracy? What about the current scan alone? (AC, R3) A: This is an interesting and worthy of deeper study. Our work was not designed with this in mind at the outset. The development of this study may require a redesign of sampling rules and data split. We will study it in our future work.

Q5: The ROI attention block proposed in the paper has not been effectively verified. (R4) A: In the ablation experiment, we demonstrated the improvement of ROI attention block compared with the baseline. However, constrained by page and time, we did not compare other advanced attention mechanisms. We will comprehensively compare other attention mechanisms, including non-local attention, in the future work to prove the advantages of our method.



back to top