Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Edward G. A. Henderson, Andrew F. Green, Marcel van Herk, Eliana M. Vasquez Osorio

Abstract

Automatic segmentation of organs-at-risk (OARs) in CT scans using convolutional neural networks (CNNs) is being introduced into the radiotherapy workflow. However, these segmentations still require manual editing and approval by clinicians prior to clinical use, which can be time consuming. The aim of this work was to develop a tool to automatically identify errors in 3D OAR segmentations without a ground truth. Our tool uses a novel architecture combining a CNN and graph neural network (GNN) to leverage the segmentation’s appearance and shape. The proposed model was trained using data-efficient learning using a synthetically-generated dataset of segmentations of the parotid gland with realistic contouring errors. The effectiveness of our model was assessed with ablation tests, evaluating the efficacy of different portions of the architecture as well as the use of transfer learning from a custom pretext task. Our best performing model predicted errors on the parotid gland with a precision of 85.0% & 89.7% for internal and external errors respectively, and recall of 66.5% & 68.6%. This offline QA tool could be used in the clinical pathway, potentially decreasing the time clinicians spend correcting contours by detecting regions which require their attention. All our code is publicly available at https://github.com/rrr-uom-projects/contour_auto_QATool

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_31

SharedIt: https://rdcu.be/cVRyL

Link to the code repository

https://github.com/rrr-uom-projects/contour_auto_QATool

Link to the dataset(s)

https://github.com/deepmind/tcia-ct-scan-dataset


Reviews

Review #1

  • Please describe the contribution of the paper

    X

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    X

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    X

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    X

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    X

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper attempts to train a solution for identifying segmentation errors. Segmented parotid glands were perturbed with a tiny amount of random noise, and then smoothed. The network architecture is interesting, containing both a CNN and a graph NN. Graph elements are classified into 5 groups related to their distance from the original segmentations. Overall, this is a worthwhile paper in that it targets an important problem and the approach is plausible. The evaluation is fine, but without a proper user study we cannot really know if this will be useful.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors present a combined CNN-GNN model to perform segmentation error prediction. The model consists of three parts. First, a CNN encoder generates features from image patches extracted along the boundary of the contour, the intention is to include appearance information into the task. Second, a GNN operating on the meshed contour with the features of the CNN encoder attached to the nodes, updates each node’s representation according to its local neighborhood. Third, a MLP classifies each node’s features into five classes representing different bins of signed (inside/outside) distances to the surface of the true contour. The model is trained and evaluated with synthetically perturbed contours of parotid glands.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The major strength of the work is the novelty of the hybrid CNN-GNN for contouring error prediction. The method does not require secondary training of alternative models as related ensemble approaches or statistical model information for local error predictions, and differently to classification approaches only finding failed contours, the method offers with the node-wise classification richer information regarding potentially erroneous parts of the segmentation. Also, the training time for a new organ of only 10 minutes is very interesting for a potential practical application as an additional QA tool for both manual and automated contours.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Although, the work is verry interesting methodically a few major weaknesses can be identified:

    Terminologically the authors claim their method to be self-supervised (training with synthetic data) and unsupervised (pre-training of the feature extraction), however neither is the case. I would categorize the creation of the perturbed contours as a smart form of data augmentation and the pre-training to be supervised as the class of each patch to be on the contour or not is directedly extracted from the reference contour. The usage of both terms self-supervision and unsupervised pre-training is confusing and does not follow their definitions. The authors use the public Dataset of Nikolov et al. as basis for their work. The dataset contains human bias freed reference contours (per-reviewed by expert radiologists and oncologists), and contours of a radiologist which are used in the clinical study of Nikolov et al. as a human reference. The authors simply claim to use one of the contours. Although the dataset is only used as a proof of concept for the work, the authors should make clear which contour they use and not train on human biased contours. The method is trained and validated on the same synthetic dataset, as the approach is methodically novel there is no reference method available, however, it would add a lot of value to validate the method on real erroneous contours i.e., the second contours of the used dataset created by an independent human annotator or the output of a model.
    The selection of the hyperparameters for the creation of the synthetic dataset as well as the process of meshing the contour seems to be arbitrary. It could make sense to perturb the contours within the range of a meaningful observer variability or an expected error range of a current automated segmentation approach, both values that are studied in the work of Nikolov et al.. Although, the authors claim in the discussion that a future step is to generate training data based on real observer variations, could a meaningful parametrization of the perturbations already be integrated. In this sense, also, the resolution of the mesh could be identified rule based regarding the size of the organs, while the chosen values seem to work well for the tested organ, it is not clear how the mesh resolution would suit much larger or smaller organs. Finally, also the selection of five bins/classes of distances is missing an argumentation.
    The ablation study shows that the pre-training of the feature extracting CNN does not lead to an increase in performance, dedicating less than a whole section for that process could be considered.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The dataset is publicly available, and the authors claim that the code will be made available as well. However, the architecture of the networks is not becoming clear with Fig.2 and 3 only. What are the values in the CNN, GNN and fully connected blocks in Fig.2+3 indicating? Also, the dimension of the features generated by the CNN and later the fully connected layer for the classification is relevant information that is missing.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Next to the identified major weaknesses only a few minor problems are identified:

    Page 2, Introduction at the end: An ablation study is probably not a contribution, self-supervision terminology

    Page 5, Unsupervised pre-training: pre-training vs transfer learning terminology, self-supervised vs unsupervised pretext task terminology

    Page 7, Ablation tests: what is an erratic validation? Loss curves could probably be included, but in combination with the training loss. The validation loss is really smoothed by the pre-training, seeing the curves also helps to understand the statement in the discussion, claiming that the training is smoothed.

    Page 8, Discussion: The sentence: “However, these approaches require the adoption of the new segmentation models themselves.” is unclear. Also: “… fills a void as most …”, what is meant by fills a void? The statement that the Parotid Gland is a difficult organ to segment needs a reference.

    General: Most of the paper is written in past tense, which, if at all, should only be used while discussing related works.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty and practical relevance of the introduced method outweighs the weaknesses of the work. The selection of the hyperparameter can be adapted to meaningful contouring errors and the evaluation extended to at least the human annotator in the used dataset (better to results of a current automated approach), in order to additionally get an evaluation on practical but not only synthetic contours. The partly incorrectly used terminology must be corrected, but overall, the methods innovative combination of CNNs and GNNs to consider both appearance and topology very efficiently for contouring error prediction justifies an accept.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    For developing a tool to automatically identify errors in 3D OAR segmentations without a ground truth, this paper proposes a novel quality-assurance (QA) architecture combining a convolutional neural network (CNN) and graph neural network (GNN) to leverage the segmentation’s appearance and shape. Experiments demonstrate the effectiveness of the proposed QA method. Paper structure is good. Figures are nice.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Topic is relatively fresh. Lots of paper talk about how to delineate targets, while limited number of them think about error control.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Idea is clearly presented. However, it seems like only five level of errors are analyzed, from larger than 2.5mm to smaller than -2.5mm. Is there any reason for you to use the two threshold of 1mm and 2.5mm?
    2. The dataset description is rare. At least we should know the resolution of the original CT scan, since you talk about errors less than 1mm. If the pixel spacing of CT is large than 1mm(especially in axial direction), the results become meaningless.
    3. The paper is more clinical driven instead of technique itself. As mentioned in radiotherapy, normally there is a margin between PTV and CTV. The margin is sometimes 5mm. If so, the importance of 1mm-2.5mm error check is not that important. So, talk about the motivation.
    4. From Fig 4, subfigures a d are easy to understand that large errors are easier to be recognized. However, from b, it seems like the decrease of performance on small errors(confusion matrix value of -1mm-1mm)is not that obvious comparing with that of large errors. How do you comment that?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Due to the provided implement details, I guess it is reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    see main weaknesses part

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea is relatively new.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This work presents an approach to predict segmentation error using a CNN appearance encoder and a graph neural network encoding shape information of likely segmentations. Reviewers agree that the idea is novel. The overall context and motivation of the work is highly relevant. Reviewers mention a number of weaknesses that can be addressed in a minor revision. More extensive experimentation would be worthwhile, but is out of the scope for this paper. Overall, the favorable reviewer assessment outweigh the criticisms, thus leading to a provisional acceptance of the paper. We ask the authors to take into account the issues raised by reviewers in case of final acceptance.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2




Author Feedback

N/A



back to top