Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Jiawen Yao, Xianghua Ye, Yingda Xia, Jian Zhou, Yu Shi, Ke Yan, Fang Wang, Lili Lin, Haogang Yu, Xian-Sheng Hua, Le Lu, Dakai Jin, Ling Zhang

Abstract

Esophageal cancer is the second most deadly cancer. Early detection of resectable/curable esophageal cancers has a great potential to reduce mortality, but no guideline-recommended screening test is available. Although some screening methods have been developed, they are expensive, might be difficult to apply to the general population, and often fail to achieve satisfactory sensitivity for identifying early-stage cancers. In this work, we investigate the feasibility of esophageal tumor detection and classification (cancer or benign) on the noncontrast CT scan, which could potentially be used for opportunistic cancer screening. Global context features of the esophagus have been proven in clinical practice as key signs for cancer detection, especially early-stage ones. To capture such global context, a novel position-sensitive self-attention is proposed to augment nnUNet with non-local interactions. Our model achieves a sensitivity of 93.0% and specificity of 97.5% for the detection of esophageal tumors on a holdout testing set with 180 patients. In comparison, the mean sensitivity and specificity of four doctors are 75.0% and 83.8%, respectively. For the classification task, our model outperforms the mean doctors by absolute margins of 17%, 31%, and 14% for cancer, benign tumor, and normal, respectively. Compared with established state-of-the-art esophageal cancer screening methods, e.g., blood testing and endoscopy AI system, our method has comparable performance and is even more sensitive for early-stage cancer and benign tumor. Our proposed method is a novel, non-invasive, low-cost, and highly accurate tool for opportunistic screening of esophageal cancer.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_33

SharedIt: https://rdcu.be/cVRtj

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    In this work, authors proposed a novel, non-invasive, low-cost, and highly accurate tool for opportunistic screening of esophageal cancer based on nonconstract CT scan, including esophageal tumor detection and classification (cancer or benign) task. The model achieves a sensitivity of 93.0% and specificity of 97.5% for the detection of esophageal tumors on a holdout testing set with 180 patients, which outperforms the mean doctors by absolute margins of 17%, 31%, and 14% for cancer, benign tumor, and normal, respectively. It is even more sensitive for early-stage cancer and benign tumor, compared with established state-of-the-art esophageal cancer screening methods, e.g., blood testing and endoscopy AI system.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. They present a deep learning method to detect and classify esophageal tumors from noncontrast CT, a novel, non-invasive, low-cost, ready-to-distribute, and highly accurate tool, for screening esophageal cancer.
    2. They proposed the position-sensitive full-attention layer to better use the positional information and long-range dependencies in 3D noncontrast CT, which could improve the performance over a strong baseline nnUNet model.
    3. Compared with doctors’ reading of noncontrast CT, they automated method shows substantially higher accuracy in both detection and classification.
    4. Compared with established state-of-the-art esophageal cancer screening methods, e.g., blood testing [11] and endoscopy AI system [14], they screening tool has comparable performance and is even more sensitive for early-stage cancer and benign tumor.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors seem to not describe clearly the main methods including how to train the segmentation model and what is the loss function, the operation details of the position-sensitive full-attention layer which would be the main distribution in this study, such as, whose position is the position o=(i,j,k), possible position p et al, as well the classification steps. These will influence the readability and reproducibility of the proposed model.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    It is interesting and a good idea to improve the tumor segmentation performance by proposing the position-sensitive self-attention layer for each encoding layer in the segmentation network nnUnet. However, detailed method and trained procedure were not introduced clearly, which influence the reproducibility of this work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    1) Clear description on the operation of global self-attention layer should given for better readability and reproducibility. 2) The classification method should be described in details. 3) How to train the segmentation network should be described clearly.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors proposed a good idea to improve the tumor segmentation performance by proposing the position-sensitive self-attention layer for each encoding layer in the segmentation network nnUnet. This a well novelty. However, its method seems to not be described clearly, which influence the reproducibility.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents a deep learning method to classify esophageal tumors from non-contrast CT. The deep learning method is based on a baseline nnUNet model (ref 8 in the paper) but incorporates position-sensitive full-attention layers. The authors claim improved performance of their method compared with doctors’ reading of non-contrast CT (which is not the gold-standard screening method for esophageal cancer), and a performance comparable to established state-of-the-art esophageal cancer screening methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The large dataset from 4 institutions is an important strength.
    • The manuscript is well-written and well-organized.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The statistics are lacking in this paper: a) Whenever a value for a performance metric is reported (such as AUC, sensitivity, specificity) it should be accompanied by an error estimate (preferably a 95% confidence interval), b) No statistical tests were performed to support claims of superior performance. Just because one number appears to be higher than another, does not mean that this improvement is statistically significant.
    • In spite of the large dataset (and relatively large hold-out test set which is presumably completely independent from training/validation and only used once as a test set), there are only 80 normals within the test set. The authors note that the prevalences are different from screening, which is the intended applicationI (it is normal practice to do this to increase statistical power), but the authors don’t focus much on the performance for these normal cases. Of course you want to detect all cancers but the cost of false-positives is important, especially when extrapolating to a screening setting. Without statistical proof or error estimate, the baseline nnUNet (Figure 3) seems to have no false-positives for normal scans while the proposed method finds one more cancer at the cost of 2 false-positives. Is this benefit worth the cost (assuming statistical proof can be made)?
    • The ‘reader study’ is not really a reader study but a comparison (without supporting statistics) of performance of the deep learning to that of physicians. What is of more interest is how the physicians perform with and without aid.
    • The model appears to be an incremental change from the baseline model (ref 8)
    • Because of the class imbalance of the test dataset 80:20:80, the authors should consider adjunct weighted versions of ‘accuracy’ as performance metric since ‘accuracy’ is influenced by class prevalence.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Neither code nor data seems to become publicly available. The reproducibility statement is unclear to me in that under #4 the authors filled in N/A on a few occasions whereas I think that for any AI method the exploration of hyper-parameters and sensitivity are applicable.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Minor comments:

    • It is unclear what the last 3 columns in Table 1 are; are these sensitivities? It seems Table 1 and Figure 3 are conveying the same information? It would be good to have Table 1 directly above Figure 3 in the manuscript, rather than be interspersed with Figure 2 which shows example output.
    • ROC curves should be plotted on square axes.
    • Figure 3 (ROC curve) needs error bars for the operating points (and preferably a shaded 95% confidence interval for entire curve)
    • It is unclear how the AI operating point was determined. This should have been done based on the training/validation results without taking into account the test set at all, i.e., a threshold for the output score should have been determined for the training/validation set and then applied to the test set. Please clarify.
    • The authors talk about detection and classification, but the performance evaluation is framed entirely as a classification problem (ROC, sensitivity, specificity) without localization or FROC (free-response ROC). Localization is implicitly included through a cutoff for the Dice score (with the reference standard), but subsequent performance evaluation assigns a single label (cancer, benign, normal) to a scan. Looking at the example output of the method (Figure 2), it would be possible, e.g., to have more than one false-positive region in a scan marked as false-positive. In true detection problems, FROC analysis is important to evaluate performance, especially the number of, and types of, false-positive marks for normal scans.
    • The overlap of 0.1 for the Dice score that determines a true-positive ‘hit’ seems to be very low; why so low?
    • The sub-analysis by cancer stage is interesting but lacks statistical power (and needs error estimates). The dataset needs to be better described in terms of stage and lesion sizes in the ‘Datasets’ section
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This seems to be a pretty strong paper, but the lack of statistical analysis is a major weakness.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The work provided a self-attention-based nnUnet model for screening esophageal cancer, as a non-invasive, low-cost, ready-to-distribute, and highly accurate tool, which showed strong performance compared with doctors and other AI tools.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strengths of the paper: (1) One novel effective opportunistic esophageal cancer screening model. The position-sensitive full-attention layer could improve the nnUNet performance. (2) Very detailed experiments and discussions. A. The authors implemented the two-class classification and three-class classification experiments respectively (see Table 1). B. The detailed experiments comparison with other algorithms and readers (see Fig.3, Table2, and Table 3). (3) The strong evaluation (see Table 1-3 and Fig.3). (4) The big dataset and strong annotation would ensure the study’s quality.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weaknesses: (1) The overall novelty and innovation of the work were limited. The self-attention module was not new tech, and nnUNet was also the classic segmentation method. (2) The experimental demonstration and discussion of the research work were sufficient and complete, but the research methods and ideas were relatively simple.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    According to the Reproducibility Response, I think the reproducibility of the paper was good. Providing the common dataset and code can improve repeatability.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    I think it was a good study, with relatively complete, sufficient experiments and discussions. Further refinement of the language and supplementary charts will help this study to be published in a good journal.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The experiments were very complete and meticulous.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The authors introduce a pipeline to investigate esophageal tumor detection and classification using non-contrast CT scans. The work introduces a position-sensitive self-attention module with non-local interactions to augment the baseline nnUNet. On a holdout testing cohort of 180 patients, the pipeline demonstrated high sensitivity/specificity for the detection stage and the three-class classification stage showed high performance. Additional superiority over other AI-based esophageal cancer screening systems is documented. The paper is well-written and organized, has novel technical contribution, and provide a non-invasive and low-cost approach for an important application. The work demonstrated the potential to be used for opportunistic cancer screening. I align with all the reviewers about that the work contribution and the relatively complete analysis with sufficient experiments and discussions. Few points are missing to enhance the work including that the authors need to (1) provide statistical analysis (with p-values) to the reported methods, (2) to comment on how they tackle the data imbalance for the three classes, (3) add details on hyperparameters settings and optimization for reproducibility. Were the training epochs really 1000, or this is a typo? Also, for R2 constructive feedback points regarding the Dice threshold and ROC curves

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

N/A



back to top