Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Meng Han, Xiangde Luo, Wenjun Liao, Shichuan Zhang, Shaoting Zhang, Guotai Wang

Abstract

Multi-organ segmentation in abdominal Computed Tomography (CT) images is of great importance for diagnosis of abdominal lesions and subsequent treatment planning. Though deep learning based methods have attained high performance, they rely heavily on large-scale pixel-level annotations that are time-consuming and labor-intensive to obtain. Due to its low dependency on annotation, weakly supervised segmentation has attracted great attention. However, there is still a large performance gap between current weakly-supervised methods and fully supervised learning, leaving room for exploration. In this work, we propose a novel 3D framework with two consistency constraints for scribble-supervised multiple abdominal organ segmentation from CT. Specifically, we employ a Triple-branch multi-Dilated network (TDNet) with one encoder and three decoders using different dilation rates to capture features from different receptive fields that are complementary to each other to generate high-quality soft pseudo labels. For more stable unsupervised learning, we use voxel-wise uncertainty to rectify the soft pseudo labels and then supervise the outputs of each decoder. To further regularize the network, class relationship information is exploited by encouraging the generated class affinity matrices to be consistent across different decoders under multi-view projection. Experiments on the public WORD dataset show that our method outperforms five existing scribble-supervised methods.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43990-2_4

SharedIt: https://rdcu.be/dnwLe

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper describes the development of a weakly supervised approach to multiple organ image segmentation using initializations based on scribbles. The approach is novel in that it performs the segmentations in 3D using a triple-branch network with one encoder and three decoders, each using a different dilation rate. In addition, voxel-wise uncertainty is used to help rectify soft-labels and supervise the 3 decoder outputs. Promising results, outperforming other existing methods, are shown using the WORD public dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    There is innovation/ novelty in the architecture design based on the 3-decoder channels with different dilation rates as well as the multi-view 3D full segmentation ideas. The use of K-L convergence to compare affinity matrices is interesting and the loss functions make sense overall. The implementation description is straightforward and easy to understand. Finally, both the quantitative and qualitative results look promising on this standard WORD dataset, indicating that those working in this area will find the work of interest.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While performance is promising, the improvements are generally incremental, although more promising for some organs (e.g. liver and pancreas in Table 1). Testing on more than this one dataset would also be helpful. While the ablation analysis is helpful, more interpretation about the utility of both the different dilation rates (is it mainly to reduce over fitting ?) and the uncertainty weighted pseudo labels would also be welcome.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducibility description is reasonable and the results on a publicly available dataset make the work accessible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    As noted above, anything more that can be said about interpretation would be useful….e.g. about the utility of both the different dilation rates (is it mainly to reduce over fitting ?) and the uncertainty weighted pseudo labels would also be welcome.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Both the quantitative and qualitative results are promising and useful for those working in this area. Some novelty is evident in the architecture and the multiple dilations. In general, a reasonable paper with incremental and some substantial improvements to the state of the art in weakly supervised, multi organ segmentation.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents a weakly supervised method for scribble-based abdominal organ segmentation. The method connects three decoders with different dilation rates with the encoder and takes their average outputs as the pseudo labels. Then a uncertainty-weighted pseudo label consistency loss and a class-similarity consistency loss are utilized to train with unannotated pixels. Experimental evaluations are conducted on a public CT dataset, showing improvements over previous methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The presented method obtains better performance than comparison approaches.
    • By using scribble annotations, segmentation performance of several abdominal organs can reach over 90% in Dice.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The technical innovation is incremental compared to the existing scribble-based segmentation method DMPLS. The designs of multiple decoders and pseudo label ensemble is similar to DMPLS, which contribute to the major performance improvement as shown in ablation study in Table 2.

    • The motivation of using decoders with multi-scale dilations for scribble-based segmentation is unclear.

    • Important information on the scribble annotations lack. How are the scribble annotations obtained. Are both foreground and background pixels are annotated? Are there any requirements on the scribbles?

    • What is the loss L_{MAC} in Eq. 5?

    • Why do the results of best-performing model in Table 2 not match the results in Table 1?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Code will be released.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • The authors need to justify their technical contributions compared to previous works.
    • It needs to be clearly explained why using multi-scale dilations is important for scribble-based segmentation.
    • Detailed information about the scribble annotations needs to be provided.
    • The results provided in Table 1 and 2 are confusing.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The technical innovations of this work are limited and the motivation of method designs is not well explained. Important information on scribble annotations is not provided and the results are confusing.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The authors have justified their technical contributions in their rebuttal, which are reasonable although not significant. Other unclear details have also been addressed. With the fair technical contributions and promising results, the reviewer raised the score to “weak accept”.



Review #4

  • Please describe the contribution of the paper

    This paper discuss the multi-organ segmentation from abdominal CT images using DL approach. Manually, multi-organ segmentation is clinically a time consuming task. To reduce that burden, AI based solutions are gaining importance (what we call as inference engines). Technically the paper is exceptionally well written with recent references cited and discussed with sufficient details in literature review. However, there the gist is missing to convince the technical content to a non-expert or to a Radiologist. They cannot go by integer number always, little explanation from domain perspective also makes this paper stronger.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors have proposed a DL based technique for the multiple anatomical regions segmentation from the abdominal CT images. Technically the paper is good with perfect balance of explanation in all sections. The proposed methodology can be re-executed, tested and can be used for improving the model further. The shared dataset WORD is accessible to others also for work reproducibility. The results looks promising.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Technically paper is good but from domain perspective it fails to convince the medical professional. Apart from few imaging parameters, Fig 2 and few quantitative validations, there is no details of anything from image processing or radiology perspective. Also, only limited few good cases of images are considered in the work and diverse data and cases are not seen. There is no discussion on the subjective evaluation of the results and the testing of the method clinically before doctors. The results are purely concluded based on the statistical integer numbers. I do not prefer to use the word novel

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The work can be reproduced, the results can be generated for analysis by the experts.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    1. only the quantitative validations are discussed. Who validated the end results clinically?. We cannot just rely on the numerical values produced by the DL black box. We have tons of DL models developed in medical image segmentation and all authors claims that their model is best. But if we look it from practical perspective, the radiologists are still not convinced with this black box many times. Clinical validation of the results with subjective assessment is very essential which is not mentioned anywhere in the paper. Please refer to the paper - https://pubs.rsna.org/doi/full/10.1148/ryai.2020200029#:~:text=CLAIM%20is%20modeled%20after%20the,authors%20in%20presenting%20their%20research

    2. The word “we” is used more than 35 times. Write it as third person. Revisit english once

    3. Page 1, last line: There is nothing called 3D image in CT. This word id interchangeably used. It is 3D volume

    4. Do not use the word novel. Instead, say “a highly optimized” or highly efficient

    5. You have mentioned that slice thickness of your dataset was from 2.5mm to 3.0mm. But if I look at the results, the boundaries have smooth transition between the slices in 3D volume in Fig 2. My question is if ST=2.5mm, you will not get such a smooth transition near boundary. My experience says that the volume shown in Fig 2, last row have been creted with CT images of ST<0.5mm. Hence, I would like to know whether you did any slices interpolation to achieve isotropic voxel creation during 3D volume reconstruction?

    6. There is no discussion on the diagnostic quality of the images. Was there any artifacts like partial volume effect? streak artifacts? motion artifacts? poor tissue contrast? in the 150 datasets? How was the model performed in segmentation in these cases?

    7. Whether your work extract the features on 2D axial slices or from 3D volume? as the features on 2D slices differs from 3D volume features

    8. The hyper parameters used in the work are listed. But why only these values? why not other values? and what is the reason behind these values compared to other values is not discussed. What is the rational behind these specific values is important.

    Section 3.2

    1. Include the corresponding axial CT slices as first column in Fig 2, here we cannot just look into the GT and the results, we need the reference of original slices also to ensure that the GT is correct.

    2. Did you not consider the large and small intestine segmentation? Why are these two excluded?

    3. As the images in Fig 2 are too small like thumbnails, I advise you to include your results in secon column after the ground truth column. Otherwise it is difficult to compare after zooming. Currently I am unable to visually inspect the GT and the last column

    4. In your results, in last row (3D visualization), the pancreas appears little bigger in GT and smaller in your results? What is the reason? Is it due to the volume is slightly rotated in your results? similarly stomach size is slightly smaller.

    5. how was the ground truth drawn in the WORD database images? list the steps followed there.

    6. Why there is no attempt to visualize the segmented results through direct volume rendering methods. That would have shown the anatomical composition in a more appealing way as surface rendering shows only the isosurfaces of constant intensity.

    7. When there are volume relate measurements for the validation of segmented volume between GT and your results, why only DSC and 2 measures such as ASD and HD are calculated? why not others measures like, a) relative absolute volume difference in %, b) average symmetric absolute surface distance in mm, c) symmetric RMS surface distance in mm and d) maximum symmetric absolute surface distance in mm? These also would help equally to make a robust validation strategy in your work

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I would give weightage of 70 percent for acceptance and remaining 30 can be scored if author address all the comments. Hence I recommend for acceptance after major modification. Overall work is good with certain weaknesses.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    3

  • [Post rebuttal] Please justify your decision

    The rebuttal is very vague. The comments are not addressed properly. It is difficult to match the justifications for each comment as all responses are clubbed together




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    This paper presents a weakly supervised approach for multiple organ segmentation using scribble annotations. The method utilizes a triple-branch network with different dilation rates and voxel-wise uncertainty. The results show improvements over existing methods on the WORD dataset. The paper is technically sound and accessible, but there are some weaknesses that need to be addressed. According to the reviews, the rebuttal should justify the technical contributions, provide detailed information about the scribble annotations and their acquisition process, clarify the meaning of “L_{MAC}” in Eq. 5, include subjective evaluation and clinical validation of the results, discuss potential artifacts and challenges in the dataset, clarify whether features were extracted from 2D axial slices or the 3D volume, provide rationale for chosen hyperparameter values, justify the exclusion of large and small intestine segmentation, improve visual presentation of results, describe the steps in creating the ground truth annotations.




Author Feedback

We warmly thank the reviewers for their positive/constructive comments. They say that our method is “novel” (R1), “promising” (R1&R4), “outperforming other existing methods” (R1&R2), and “technically exceptionally well written” (R4). Here we address the main points in their reviews.

*Novelty against DMPLS [15] (R2) Our TDNet has several important differences from DMPLS: 1) It has 3 asymmetric decoders with different dilation rates to mine complementary features at different scales, while DMPLS has two decoders with a shared structure used with dropout, leading to very similar predictions that are less complementary to each other. 2) TDNet obtains soft pseudo-labels that are more informative and robust than hard pseudo-labels in DMPLS that may be over-confident. 3) We introduce uncertainty-weighted losses to more effectively learn from pseudo labels, while DMPLS ignores the uncertainty information. 4) We introduce a class affinity-based consistency loss for high-level regularization, but DMPLS only considers pixel-level supervision.

  • Motivations and contributions (R1&R2) -Decoders with different dilation rates: this makes the three decoders extract features at different scales, and introducing a consistency regularization between them improves the robustness of dealing with organs at various scales. They also improve the feature learning ability of the shared encoder. Table 2 and Fig. 3 show that compared with using the same dilation rate, our method has a better performance with less over-segmentation. -Uncertainty weighted pseudo-labels: It highlights reliable pseudo labels and suppresses unreliable ones, avoiding the effect of low-quality pseudo labels. Table 2 and Fig 3 show that this improved the performance. We will rephrase the corresponding sentences in section 2.2 to clarify this.

*Hyperparameter (R4) alpha_t and beta_t are time-dependent hyper-parameters based on a ramp-up function. In the early training stage, their values are relatively small as the pseudo labels are not accurate enough. When the model is trained with more epochs, the pseudo labels are better, so the weights of the consistency regularization terms are improved. The maximal value of alpha_t and beta_t (alpha and beta) were set to 1.0 based on the best performance on the validation set, which will be easily clarified in the manuscript. Due to the space limit, we did not show the sensitivity of hyper-parameters, but will include it in a future journal version.

*Dataset and scribbles (R1&R2&R4)

  • Due to the space limit, we only experimented with one public dataset WORD [17], and more datasets will be used in a future journal version. -Slice thickness was 2.5-3.0mm as described in [17], and the ground truths were obtained by a 7-year experienced oncologist and verified by a more senior one with over 20 years’ experience. Fig 2 was based on visualization in ITK-SNAP, and the smoothness along the z-axis may be due to the interpolation during visualization. We did not use any slice interpolation during training or testing. -The scribbles were simulated by a script (as described in [17]) to imitate human drawing. Scribbles for foreground organs and background were obtained in axial view in each volume. We will clarify this in Section 3.1.

*Tables, Figures, and writing (R2&R4) -We followed the common practice of using the validation set for ablation study (Table 2), and testing set for comparison with SOTA methods (Table 1), so the best results in the two tables are different. -Fig. 2 only shows one case due to the limited space. There was an under-segmentation of some organs here, and we also observed some over-segmentation in other cases. Anyway, Fig. 2 shows that our method was much better than the other weakly supervised methods. -We are sorry for the typo in Eq.5, where L_MAC should be L_USPC defined in section 2.2. We will correct this, and some other related descriptions and Fig.2 will be improved according to the reviewers’ suggestions.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper exhibits strengths in its innovative architecture design, the use of voxel-wise uncertainty for decoder supervision, and clear implementation description. Although concerns regarding domain perspective explanation and clinical validation were raised, the authors have largely addressed critical issues and demonstrated promising results. To enhance the paper further, I suggest incorporating additional clinical discussions and explanations to validate the method in clinical settings, addressing concerns raised by reviewers. Considering the reviews, rebuttal and the paper, I recommend accepting the paper.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper proposes a weakly supervised approach to multiple organ image segmentation using initializations based on scribbles. Its key novelty is in the use of a triple branch network with one encoder and three decoders to handle multi-resolution based segmentation. The authors satisfactorily addressed the reviewers concerns.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper proposes a weakly supervised method for scribble-based abdominal organ segmentation. The rebuttal addressed several critical concerns raised by reviewers, such as technical novelty, clinical motivation and dataset definition. There indeed exist drawbacks of the rebuttal that not all the concerns are well addressed (reviewer 4 also pointed out this), which maybe due to the limited space. However, considering the overall contributions of the paper, I recommend acceptance.



back to top