Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yanbo Shao, Minghao Wang, Juanyun Mai, Xinliang Fu, Mei Li, Jiayin Zheng, Zhaoqi Diao, Airu Yin, Yulong Chen, Jianyu Xiao, Jian You, Yang Yang, Xiangcheng Qiu, Jinsheng Tao, Bo Wang, Hua Ji

Abstract

Lung cancer has been one of the greatest lethal cancers worldwide. Computed Tomograph (CT) makes it possible to diagnose lung cancer at an early stage, which can significantly reduce its mortality. In recent years, deep neural networks (DNN) have been widely used to improve the accuracy of benign and malignant pulmonary nodules classification. But the limitation of DNN approach is that AI model’s performance and generalization highly depend on the size and quality of the training data. With our best knowledge, almost all existing public lung nodule datasets, e.g., LIDC-IDRI, obtain the crucial benign and malignant labels by radiographic analysis, instead of pathological examination. In this paper, we argue that, without pathology report and hence lack of labels’ authenticity, LIDC-IDRI based machine-learning (ML) models are short of generalization. To prove our hypothesis, we introduce a new lung CT image dataset with pathological information (LIDP), for lung cancer screening. LIDP contains 990 samples, including 783 malignant samples and 207 benign samples. More critically, the labels of all samples have been all examined by pathological biopsy. We evaluate various of existing LIDC-based state-of-the-art (SOTA) models on LIDP. Our experimental results show the extreme poor generalization ability of existing SOTA models that are trained on LIDC-IDRI dataset. Our scientific conclusion is striking: the distributions of these datasets are significantly different. We claim that the LIDP dataset is a very valuable addition to the existing datasets like LIDC-IDRI. LIDP can be well used for independent testing or for training new ML models for lung cancer early detection.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16437-8_74

SharedIt: https://rdcu.be/cVRuZ

Link to the code repository

https://github.com/MHW-NKU/LIDP-model-evaluation

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a database of CT images suitable for the development and evaluation of pulmonary nodule classification algorithms, of clinical relevance, specially in lung cancer screening.

    The main contribution of the paper is that, to the authors claim and to my understanding too, this is the first lung cancer screening dataset that has pathological gold standard for its nodules. Previous databases use radiological interpretation of the type of nodule by experienced radiologists, however, there is an inherent likely interpretation error by such label. The well accepted gold standard for nodule classification is biopsy and pathology assessment. All the patients underwent surgery post their CTs and the nodules were analysed pathologically.

    The database is relatively large (990 CT scans), from a plurality of institutions (8) . The authors further show the poor performance of state of the art algorithms train on other databases on the new database.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The strengths of the paper are:

    • A new database of pulmonary nodules with pathology-based gold standard
    • Proof of lack of generality of algorithms developed on the LIDC-IDRI database
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Clarity of presentations. Even after two reads, it is unclear to me the number of nodules in the database. What is the precise meaning of ‘samples’? Do the author refer to CT scans, nodules or pathology reports? Please clarify. Please confirm that all visible nodules on the CT scans have pathology-based reference standard. The section on the visualization of the datasets is not very informative, since there is a lot of processing involved. What are the features extracted? Are the features from the nodules or from the whole CT? A more interesting analysis would be the comparison on the image characteristics of LIDP and LIDC-IDRI with respect to reconstruction kernels, slice thicknesses, etc… Simply that information could explain the differences in visualization of figure 1.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    This paper will only be reproducible if the database is made publicly available. The authors do not mention that on the paper. I strongly recommend doing so.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Define LIDP. What does it stand for? It is very unclear what do the authors refer as ‘samples’. Please clarify.

    It is sobering to see negative results published. Having another dataset for nodule classification is of great use to the community.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I rate the paper as “Strong accept” assuming that the database is going to be made publicly available. If that is not the case, I would rate it as weak reject.

    As stated above, the main contribution of the paper is the very strong dataset that the authors have generated by using the gold standard pathology information. This could be a landmark database for pulmonary nodule classification AI models.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces a new Chest CT dataset consisting of labeled nodules (location and segmentation) with pathology-based ground truth and demonstrates its importance in comparison to limitations identified in the LIDC_IDRI dataset. Models trained on both datasets are used to demonstrate these findings.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The key strength of the paper is that it provides an open set of images by which future research can build off of to develop and evaluate new approaches (detection, classification, etc.) for lung cancer detection. It addresses weaknesses of LIDC-IDRI used in broad research topics by having pathologically confirmed ground truth. This is data, if easy to share and access, would be of a big benefit to the research community in general.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper attempts to cover a wide area of topics and ends up missing important details that make it difficult to assess the quality of the data curated. It would be better to eliminate extraneous material in the manuscript and focus on details around the dataset and its preparation to better emphasize why this dataset is of sufficient quality or to highlight its limitations for future research. Proper adjudication appears to be missing, making the dataset less appealing for research without re-labeling.

    Most striking is the following:

    ADJUDICATION - section 3.1 - The contour “was checked twice” seems like a very unscientific way to establish a contour.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper is proposing to share a new dataset making this portion extremely reproducible. In terms of the Machine learning approaches, they appear to be extremely difficult or impossible to reproduce.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The following details limitations in the manuscript that could be better if detailed or eliminated to allow for more room to detail the dataset further.

    *Claims made should be cited: DNN models have “become one of the main computer assisted techniques for early screening of lung cancer” - please provide a reference to this. Currently DNN is leading research approaches but in terms of actual products themselves has there been a study demonstrating that DNN products are dominantly used in lung cancer screening?

    Page 2- “Data with a score of 3” - what does this mean? This should be detailed further.

    • “the classic DNN model” - page 3 top - this is simply too general as there are many different DNN networks. There is no “classic” model. This should be removed from the paper unless more detail could be provided.

    • Reference [20] - Wu, G.X., Raz, D.J.: Lung cancer screening. Lung Cancer pp. 1–23 (2016) - appears to be incomplete.

    Section 3.1 - length of a nodule is introduced here and never mentioned elsewhere - do the authors mean diameter?

    ADJUDICATION - section 3.1 - The contour “was checked twice” seems like a very unscientific way to establish a contour

    The choice of patient age >18 is strange since most approaches either target age >50 for screening or > 35 for incidental findings. *

    “Maligancy of 3” is not clear - Does this mean the malignancy was detected 3 years later?

    Section 3.4 Visualization of the dataset does not add much to the paper - consider removing this section

    Consider removing or abbreviating Section 4 - there are simply not enough details to really warrant the multiple experiments. It leaves the reader with multiple questions that are not sufficiently covered in the manuscript and simply weakness the manuscript as a whole.

    How were operating points selected?

    How was the data split? What was the split??

    How was the training and check point selection performed?

    What architecture was used?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces a much needed pathology-based cancer ground truth to the research community. This is a great step in enhancing both training, understanding, and evaluation among future papers. Focusing on the dataset preparation itself would greatly strengthen the paper. As it is, it attempts to cover too many topics with too little detail.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper presents a new dataset for lung cancer screening with pathological information, called LIDP. It is emphasized that the LIDC-IDRI lung nodule dataset has been over-studied and has crucial generalization problems.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • A new dataset (according to the authors, the largest available dataset with pathological gold standard) is presented to be used as a benchmark for early lung cancer detection and has pathological information instead of radiological analysis. • The presence of hard-to-classify samples. • The main disadvantages of the LIDC-IDRI dataset compared to the LIDP are explained, and reasons are given why only one other dataset is used for comparison. • The advantages of using this dataset, which can be used as a supplement to the LIDC-IDRI, are explained in detail. • It correctly explores the statistical and demographic distribution. • Overall, good arguments for the dataset presented.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The experimental part should be extended, e.g. in Table 2 it can be confirmed that LIDC-IDRI does not generalize well for other datasets like LIDP, but it is not possible to confirm that LIDP generalises correctly. I also agree that LIDP is very complementary to LIDC-IDRI, so it would be interesting to improve the evaluation by using both datasets to train models. The dataset is very unbalanced, containing many more malignant than benign cases, which can lead to an increase in false positives (malignant) when training the models. This is evidenced in your table 3 by high recall values but low specificity values. No information about the size (gigabyte) and the resolution of the dataset.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Even knowing that the main purpose of this paper is to present a new dataset, it should have included a description of the computer infrastructure used (hardware and software), more information about the memory requirements and other specifications about the training and testing process.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    • The introduction should be revised, as well as all references (the paper starts with the 18th reference). • Be consistent when creating acronyms, e.g., whether they start with a capital letter (Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI)) or not (Low-dose computed tomography(LDCT)). • Add reference for the Project Bell. • Explain the reason for the second sentence in point 3.1 (“When collecting…”), it is important for less experienced readers. • Why do the legends in the diagrams (LC015) differ from the actual name of the dataset (LIDP)? • - It would be interesting to train with the LIDP dataset and test with the LIDC-IDRI. • Please increase the quality of the discussion in point 4.3.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents good arguments for the presented dataset (LIDP) and identifies important weaknesses in the LIDC-IDRI dataset. Further testing and more details on these tests are needed to assess the quality of the new data, but it seems to have great potential and covers the gaps in the existing benchmark dataset. However, I am not entirely confident that the introduction of new datasets is appropriate for MICCAI.

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    main contribution of the paper is really the database, and authors are expected to make the dataset available for the public and reproducibility of the methods.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    1




Author Feedback

General Response Availability of LIDP. We plan to release the LIDP in two stages. Researchers interested in LIDP can contact the corresponding author for any non-commercial purpose within one year of the paper’s publication, and commit to abide by a non-profit agreement for scientific research only. One year after the paper is published, we will fully open LIDP on our lab’s website or GitHub. Data quality of LIDP. The thickness of CT scans in the LIDP are less than or equal to 2mm, and the slice resolution is 512*512 except three CT scans. LIDP includes 1165 annotated nodules from 990 CT scans. The final manuscript will include the introduction to the resolution of slice, slice thickness, nodule number of LIDP. The effectiveness of the annotation contour. We agree that the hand-labeled contour has a certain gap between the ground truth, which is an irreconcilable problem existing in all lung CT datasets. The contours of nodules in LIDP were annotated by experienced oncologists and these oncologists also implement two rounds of review for the accuracy of contours. The section of visualization analysis. This section has been condensed due to the limitation of allowed paper size. We utilized nodule images and the corresponding nodule features, e.g., solidity, diameter, lobulation, spiculation and calcification, to visualize the distribution of nodules of different datasets. Per suggested, we will remove this section in the final manuscript because other parts are adequate to demonstrate the differences between LIDP and LIDC-IDRI. The detail of experiment in section 4. The detail of experiment has been condensed due to the limitation of allowed paper size. In each experiment, we carried out 5-fold cross-validation for training and divided independent test sets from dataset for testing. The training and testing method will be supplemented in the final manuscript. In the final manuscript, the inadequacies of the language expressions will be revised according to your recommendation. Response to Reviewer 1 The meaning of LIDP. LIDP is the abbrev of “Lung Image Dataset with Pathological Information”. The meaning of “sample”. Sample in this paper stand for the CT scan and its corresponding pathology report. The number of nodules and pathology reports. All 1165 nodules, which is larger than 3mm in diameter, were labeled in LIDP. The pathology report of each CT scan just demonstrates only one nodule’s pathological information, the nodules with pathology report are called target nodule. Response to Reviewer 2 The meaning of “data with score of 3” and “malignancy of 3”. In LIDC-IDRI, radiologists manually assigned lung nodules malignancy score that ranges from 1 to 5, with the malignancy increases gradually. Nodules rating of 3 indicate that radiologists cannot classify them as benign nodules or malignant nodules only by CT scans, which means this kind of samples is difficult-to-classify sample. For more information, please refer to the introduction of LIDC-IDRI. The meaning of “the length of nodule” in 3.1 section. The length of nodule means the long diameter of nodule. The choice of patient age>18. In Project Bell, the age requirement for this dataset was 18 years or older. Adults are most likely to develop lung cancer. Response to Reviewer 3 The reason why the generalization of LIDP was not confirmed using LIDC-IDRI. Due to the lack of the pathological information, the labels of LIDC-IDRI are inaccuracy, therefore we did not evaluate the generalization of LIDP using LIDC-IDRI. The reason why the legends in the diagrams (LC015) differ from the actual name of the dataset (LIDP). LC015 was the original name of LIDP, which was later changed to LIDP. In the final manuscript, we will correct the legend errors. The reasons why the paper starts with the 18th reference. In our paper, references are sorted by the first letter of the first author’s name. Introduction of Project Bell. http://ncrc.gyfyy.com/index.php?ac=article&at=read&did=509



back to top