Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yihua Sun, Hee Guan Khor, Sijuan Huang, Qi Chen, Shaobin Wang, Xin Yang, Hongen Liao

Abstract

Esophageal cancer is a significant global health concern, and radiotherapy (RT) is a common treatment option. Accurate delineation of the gross tumor volume (GTV) is essential for optimal treatment outcomes. In clinical practice, patients may undergo a second round of RT to achieve complete tumor control when the first course of treatment fails to eradicate cancer completely. However, manual delineation is labor-intensive, and automatic segmentation of esophageal GTV is difficult due to the ambiguous boundary of the tumor. Detailed tumor information naturally exists in the previous stage, however the correlation between the first and second course RT is rarely explored. In this study, we first reveal the domain gap between the first and second course RT, and aim to improve the accuracy of GTV delineation in the second course RT by incorporating prior information from the first course. We propose a novel prior Anatomy and RT information enhanced Second-course Esophageal GTV segmentation network (ARTSEG). A region-preserving attention module (RAM) is designed to understand the long-range prior knowledge of the esophageal structure, while preserving the regional patterns. Sparsely labeled medical images for various isolated tasks necessitate efficient utilization of knowledge from relevant datasets and tasks. To achieve this, we train our network in an information-querying manner. ARTSEG incorporates various prior knowledge, including: 1) Tumor volume variation between first and second RT courses, 2) Cancer cell proliferation, and 3) Reliance of GTV on esophageal anatomy. Extensive quantitative and qualitative experiments validate our designs.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43990-2_48

SharedIt: https://rdcu.be/dnwL3

Link to the code repository

N/A

Link to the dataset(s)

https://competitions.codalab.org/competitions/21145


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper aims to tackle a less-studied second-course RT GTV delineation problem in esophageal cancer patients. The proposed stratified framework adopts the self-attention module, i.e., the proposed region-preserving attention module (RAM), to leverage the interactions between the first- and second-course GTV segmentation. Trained using both ‘real and synthetic pairs in-house cohort’ and ‘unpaired public SegTHOR dataset,’ the proposed method demonstrates an improved performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The strengths of the paper is summarized as follows: 1) The authors have delved into a lesser-explored subject - the auto-segmentation of second-course GTV. The effort put into preparing data for both the first and second courses is highly commendable. I hope the authors can publish the dataset if the paper gets accepted. The topic itself stands a good impact on the paper’s novelty.

    The proposed framework and RAM attempt to reduce treatment and registration-induced bias when using 1st and 2nd-course GTV pairs for training. While there may be debates regarding the general design and rationale discussions, this framework could still be effective even in cases where 1st and 2nd-course GTV pairs are not available.

    The overall idea is easy to follow. The 1) and 2) jointly dive the paper novelty.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The major concerns are list as follows: 1) The motivation for using RAM, i.e., attention, to focus on the elongated information is insufficiently discussed. The elongated information would be well addressed because the FOV is 16x and 32x of the original patch size (where the RAM is applied). E.g., if the input patch’s z-direction is 16 slices (48mm), then the FOVs (where RAM is applied) are 768mm and 1536mm, which could be good enough to cover the esophagus. The authors might want to showcase the corresponding “heat-map/attention map/CAM” to further validate/demonstrate the impact of RAM-imposed elongated information.

    2) Given the paired images and the associated 1st-course GTV, the network will prioritize the GTV region and not the elongated parts. I would assume there should be other factors for the performance gain (e.g., more focused region? registration bias reduction? context std reduction?), and the authors might want to dig further.

    3) The average esophageal cancer patient who undergone RT treatment are usually at Stage II and III. The average GTV segmentation is around 73%. Given a total of 179 patients with less than 70% Dice 1st-course GTV segmentation performance, I assume the model could be under-trained. Based on my experience, concatenating CT with the 1st-course GTV image (possibly using a more constrained/cropped region), the 2nd-course GTV shouldn’t be hard to detect. Again, a 3D UNet with 57.5% Dice and 9.14mm ASD raises a questionable or insufficient comparison. In the future, I would suggest the author use the nn-UNet as baseline and report the results.

    While I wouldn’t say the overall framework is ineffective, the authors might want to bring insights regarding the proposed framework and the performance gains.

    Some minor suggestions: 1) Please discuss the detailed differences between MHA and RAM in the Method section. Please report the training patch sizes, patient population, and maybe the GTV stats (e.g., volumn) of the 1st- and 2nd-course data.

    2) Please polish the paper, e.g., Section 4.1 – * and *, math term usage in equations, paper logic, etc.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The overall framework could be reproduced, yet lacks some training information, e.g., patch size.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Please refer to 6

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The overall topic and the general framework design drive the paper’s novelty. Based on my major concerns in section 6, I am giving a “weak reject” for now. Yet, I am open to reading the rebuttal and will make my final decision based on that.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper
    • leverage prior information from the first course to improve GTV segmentation performance in the second course
    • the training strategy does not specific to any tasks but challenges the network to retrieve information from another encoder with augmented inputs
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • one of few methods for automated second course GTV segmentation
    • a new encoder training strategy
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Though the paper addresses a understudied clinical application, and proposes a training strategy, the technical contributions are relatively limited. The region-preserving attention module is not new.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The paper is reproducible based on the parameter settings provided in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    It is suggested to further demonstrate the generalization ability of the model using different segmentation backbones.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses a clinical application that has never been studied by other computerized/deep learning methods before. The experimental results especially ablation study results demonstrate the effectiveness of the methods.

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes a novel deep learning-based approach, called ARTSEG, for accurately delineating gross tumor volume (GTV) in second-course esophageal cancer radiotherapy (RT) by incorporating prior information from the first course of treatment. The proposed approach includes a region-preserving attention module (RAM) that captures long-range prior knowledge of the esophageal structure while preserving regional patterns. The network is trained in an information-querying manner to efficiently utilize knowledge from relevant datasets and tasks. The authors validate their approach through extensive experiments, both quantitative and qualitative, demonstrating that ARTSEG achieves superior segmentation results compared to several state-of-the-art methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Addresses an important clinical need for accurate GTV delineation in second-course esophageal cancer RT.
    2. Incorporates prior knowledge from the first course of treatment, which can improve segmentation accuracy.
    3. Proposes a novel region-preserving attention module to capture long-range prior knowledge while preserving regional patterns.
    4. Trains the network in an information-querying manner, which efficiently utilizes knowledge from relevant datasets and tasks.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The dataset used in the study is limited, and the generalizability of the approach needs to be validated on larger and more diverse datasets.
    2. The study does not provide a comparison of the computational time and resources required for the proposed approach compared to other state-of-the-art methods.
    3. The proposed approach heavily relies on the quality and accuracy of the prior knowledge from the first course of treatment, which may not always be available or reliable.
    4. The study does not address potential biases in the dataset, such as variability in imaging equipment or imaging protocols, which could impact the generalizability of the approach.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors have presented a clear and detailed description of their proposed model architecture. However, the primary dataset used in this study, which includes 1 and 2 courses of esophageal cancer RT plan CTs, is not commonly used for segmentation purposes. Furthermore, the authors have not disclosed the primary dataset or training codes, which could limit the reproducibility of their study. It would be beneficial for the authors to provide additional information regarding their dataset and training process to enable others to reproduce their results accurately.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. To improve the clarity of the paper, the authors need to briefly introduce the three datasets (Sp, Sv, Se) before discussing the training strategy, or at the beginning of that section. While Sp is introduced in section 3.1 and Sv in section 3.2, the use of Se in section 3.2 is not introduced at that point. Se is introduced in section 3.3. By introducing all three datasets before discussing the training strategy, readers can better understand the data flow and follow the authors’ description of their approach.

    2. The two distinct randomized augmentations are designed to generate two different regions that can mimic an unregistered image set. However, since both regions are selected from the same dataset, there is a possibility that they may have a large common area, which could affect the accuracy of the model in identifying the differences between the two courses. As for why they are applied to the paired dataset, it is to ensure that the network is capable of handling variations in the dataset and can generalize well to unseen data. However, it would be reasonable for the authors to apply the same augmentation for both datasets to maintain consistency and reduce the risk of introducing biases.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There are many strengths, including the innovative approach of incorporating prior knowledge to improve GTV segmentation in the second course of esophageal cancer RT and the use of a novel region-preserving attention module. However, the study has some limitations, such as the limited dataset used in the study, the lack of clinical validation, and the potential need for additional computational resources compared to traditional manual segmentation methods.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper presents a method for delineating GTV contours on second-course esophageal cancer patients using information from first-course RT images. This is an important clinical problem and is not well-studied in literature. As R1 mentioned the effort to prepare the paired first- and second-course data is highly commendable. The training strategy using different public/private datasets/tasks is also novel. The superior segmentation performance is well demonstrated using extensive experiments against state-of-the-art methods with ablation studies. As R3 pointed though, limitations of this work can be clarified in the final submission, e.g. more computational/memory resources (?) perhaps versus the compared approaches, reliance on the high quality and accurate first-course GTV delineations, potential biases in data acquisition (imaging equipment or protocol) which might impact the generalization of the proposed approach. Since the code/datasets might not be released, R1 also requested more details on training patch sizes, general info about the patient population, and GTV stats (e.g. volume) for first- and second-course data. Finally, R1’s major concerns regarding motivation for RAM in elongated information, factors behind performance gain, and why simple straw man solution of concatenating first- and second-course GTV contours wouldn’t work can be further clarified.




Author Feedback

We sincerely thank the reviewers for their valuable suggestions and comments, and we will revise our paper accordingly, e.g. introducing the datasets before the training strategy (R3) and discussing the differences between MHA and RAM in the Method section (R1).

Q(R1): The motivation for using RAM. Can elongated information be addressed because of the large FOV in the deep features? A: The entire input volume has a spatial shape of 128x128x128, with a voxel size of 1.2x1.2x3 mm^3. The input is then fed into the convolutional encoder, where the deepest feature shape is 256x8x8x8 (256 is the channel dimension). Each feature (256x1x1x1) in the deepest layer has a 32x FOV of the input voxel size. So the FOV in the z-axis for each feature (256x1x1x1) is 32x3=96mm, which is insufficient to cover the entire esophagus region. Besides, given the FOV of the total input volume in the z-axis is only 128x3=384mm, the FOV in the features space cannot be enlarged to 1536mm. The attention mechanism enables the network to establish long-range correlation in the deep feature space, which helps to address the elongated information.

Q(R1): Why the understanding of elongated esophageal can benefit the network to prioritize the 1st-course GTV region? A: Since the location of the esophageal tumor (GTV) vary along and around the esophagus, it is beneficial to incorporate prior knowledge of this elongated structure. The esophagus segmentation dataset S_e challenges the network to retrieve information within the whole esophagus to facilitate understanding of the structure and emphasize a more focused region to the esophageal neighborhood. Besides, RAM helps the network to query information within the elongated structure for a more comprehensive understanding. We are willing to discover more factors for the performance gain in the network.

Q(R1): The Dice score is relatively low. A: The comparison on the test set illustrates the performance gap between the 1st and 2nd-course RT for GTV segmentation, and demonstrates that incorporating prior knowledge from the 1st-course can improve the segmentation result for the 2nd. Given 179 patients in S_v while the Dice in 1st-course still less than 0.7 probably because the S_v and S_p are from multi-center, there may exist a gap in patient populations, annotation variabilities between multi-center experts, and different exams (esophagoscope or MRI) used to confirm the tumor location for contouring GTV in CT between different center. We will take the suggestion of the reviewer to use a more constrained region as input to further improve the performance and compare it against nnUNet in our future study.

Q(R1, R3): Network details & input details. A: The total input volume shape is 128x128x128. The model is trained on a single RTX 3090 GPU. The inference time for ARTSEG+RAM is 12.60ms per case.

Q(R1, R3): Dataset statistics. A: The GTV volume (cm^3) mean/std. for dataset S_v is 40.6/29.75. In S_p, the GTV volume is 83.70/55.97 for 1st-course and 71.66/49.36 for 2nd-course. The patient population will be included in the final version.

Q(R3): Potential biases in variability in imaging equipment or imaging protocols. A: Our multi-center datasets (S_p, S_v) can improve the robustness of our method to varying imaging protocols. We will improve our future study with more diverse datasets and provide more detailed evaluations of the network’s robustness to different protocols.

Q(R3): ARTSEG heavily relies on the accuracy of GTV from the 1st-course of treatment. A: In the real radiotherapy scenario, the GTV area will be carefully reviewed by a senior expert for confirmation before the execution of treatment. So in the context of improving the 2nd-course GTV segmentation accuracy, the prior GTV annotation in the 1st-course can be viewed as reliable.

We will take the reviewers’ suggestions to clarify the limitation and improve our future study, including different segmentation backbones (R2) and generalizability (R3).



back to top