Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yuhao Huang, Xin Yang, Xiaoqiong Huang, Jiamin Liang, Xinrui Zhou, Cheng Chen, Haoran Dou, Xindi Hu, Yan Cao, Dong Ni

Abstract

Deep segmentation models often face the failure risks when the testing image presents unseen distributions. Improving model robustness against these risks is crucial for the large-scale clinical application of deep models. In this study, inspired by human learning cycle, we propose a novel online reflective learning framework (RefSeg) to improve segmentation robustness. Based on the reflection-on-action conception, our RefSeg firstly drives the deep model to take action to obtain semantic segmentation. Then, RefSeg triggers the model to reflect itself. Because making deep models realize their segmentation failures during testing is challenging, RefSeg synthesizes a realistic proxy image from the semantic mask to help deep models build intuitive and effective reflections. This proxy translates and emphasizes the segmentation flaws. By maximizing the structural similarity between the raw input and the proxy, the reflection-on-action loop is closed with segmentation robustness improved. RefSeg runs in the testing phase and is general for segmentation models. Extensive validation on three medical image segmentation tasks with a public cardiac MR dataset and two in-house large ultrasound datasets show that our RefSeg remarkably improves model robustness and reports state-of-the-art performance over strong competitors.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16452-1_62

SharedIt: https://rdcu.be/cVVqj

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a test-time adaptation method for segmentation. The authors propose to synthesise the images based on the segmentation prediction and the sketch of the input image. By iteratively improving the quality or the similarity of the synthesised image and the input image, the segmentation performance improves. Due to the domain shift, the authors propose to use two similarity losses and the GAN losses to train the model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well written and easy to understand. The problem is well motivated.
    2. The method of synthesise and “reflect” is interesting, novel and technically sound.
    3. The demo/video included in this paper is a plus.
    4. The experiments and results are extensive.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Meta-learning DG approaches are not mentioned in related work. At least the ones in previous MICCAI [1, 2]. I also suggest the authors to cite the TTA papers in previous MICCAI. There are also some relevant papers in last DART workshop.
    2. It is unclear how the segmentor for differentiable learning works. It is unclear what is the label intensity value. The multiplication and addition are not clear. I suggest the authors to update and clarify.
    3. If use the normalised bad heatmap (with failures) as the attention matrix, will it not “emphasise” the wrongly classified area (such as the RV area in Fig.2)? I mean the purpose is to penalise the wrongly classified area. But with the attention, it does not work as expected. I think ablation should be conducted and discussion about this needs to be included.
    4. The results of the proposed method are relatively very similar to some strong baselines. I think statistical significance analysis needs to be conducted, where the authors clicked the check box about this in the reproducibility response.
    5. In fact, I don’t think the inference time is a selling point of TTA method. Typically, TTA method is model-agnostic. Although, nnUNet based models are slower in terms of inference time. TTA + nnUNet models will definitely achieve better results, which has not been touched by the authors yet.
    6. For TTA, one problem is overfitting when the number of iterations goes higher. It is unclear if this happens for this method. The number of iterations in this paper is predefined as 10. It will be good to see some more experiments on this hyperparameter.
    7. The overall objective is missing.
    8. The approximated MI is not clear. Self-contained description needs to be added for the readers to understand that loss.

    References: [1] Liu, X., Thermos, S., O’Neil, A. and Tsaftaris, S.A., 2021, September. Semi-supervised meta-learning with disentanglement for domain-generalised medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 307-317). Springer, Cham. [2] Liu, Q., Dou, Q. and Heng, P.A., 2020, October. Shape-aware meta-learning for generalizing prostate MRI segmentation to unseen domains. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 475-485). Springer, Cham.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    I have the concern on the statistical significance analysis. I suggest the authors to address it.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Please see the weaknesses. I have included the suggestions there.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, this is an interesting paper though with some issues to be addressed. I strongly suggest the authors to handle these problems. Due to the good writing, well-motivated problem, interesting and novel method, extensive experiments and results and the demos, I recommend to accept this paper.

  • Number of papers in your stack

    7

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes a segmentation framework with test-time adaptation to adjust the model for each test image (before inference), which can be useful if the test image is from a slightly different domain.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper addresses an important topic in medical imaging of “on the fly” model adaptation to a new test domain. The approach appears to be novel, and most design choices are sound. The method is evaluated on 3 datasets, and achieves the best results. The method is compared to relevant competitors as well as to general sota segmentation models.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Out of 3 datasets used for evaluation, 2 are in house datasets (which makes it difficult to validation or reproduce the results). On the public dataset M&M the comparisons against nnUNet is not conducted, but included on the private data (why not include on everything). Several steps in the method are not well explained, which makes it difficult to reproduce.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Overall the method is well explained, but several components are not clear. Please see below.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    a) what is the exact edge detection algorithm used to produce the sketch? cite the paper or the implementation of the method, or explain better b) In Figure. 2: what is “Label values” block? is it a ground truth label or just and integer index of the class ? c) How do you create the “Heatmap”? explain better. Is the Heatmap a single channel image? It seem you create it by multiplying each probabily channel by the corresponding index = 0 * p(0) + 1 * p(1) + 2 * p(2)…. is that correct? d) what is the reason to combine (add) the sketch and Heatmap into a single channel image? alternatively you could have concatenated them into 2 channel input to the Synthesizer. e) what is the “normalized Heatmap H_att” on page 5, how is it produced? where is it on the Fig 2?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is missing details on some of the method components. Nevertheless the architecture choices are sound, the application is important, and the paper is easy to follow.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces a novel approach for test time adaptation, employing a cGAN-based image synthesis network to guide the segmentation network finetuning. Specifically, at test time, the segmentor is finetuned to minimize a structure similarity loss computed on the input image and the synthesised image generated by the cGAN (conditioned on the network output and an auxiliary canny map).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed structure similarity loss is sophisticated yet effective, which is a combination of mutual information loss and the normalised cross-correlation loss, weighted by predicted heatmap.
    • Extensive experiments were performed on three cross-vendor datasets. Results demonstrate the efficacy of the method.
    • Ablation study has also been conducted to verify the contribution of each component.
    • The online demo provided is impressive.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Missing reference. The idea of using synthesized image for failure detection has been explored in [1], which should be cited and discussed in the related work.
    2. Methodology design. For the test time adaptation, why the synthesizer needs to be finetuned in together with segmentor? I am concerned with the instability of the optimization with such a dynamic image synthesis network at test time. Have the authors tried to optimize the segmentation network only?
    3. Methodology design. Any stopping criterion for the test time adaptation? From Fig 4, it seems that the performance can drop after a number of iterations.
    4. Unstable performance observed on test cases. The Dice curves over the test set in Fig 4 shows that there are still quite a few cases where RefSeg fails to refine the segmentation (especially on the M&Ms dataset), yielding even lower Dice scores during test time adaptation. Are these failure cases all from the unseen test domain due to the domain shift? A discussion on those failure cases should be added.
    5. The definition H_att is not that clear to me. Better to give a mathematical definition on it. It seems that H_att is the network predictions after softmax multiplied by label values. Why not simple feed the multi-channel output to guide the image synthesis? Please clarify.

    [1] Li, K., Yu, L. and Heng, P.-A. (2022) ‘Towards reliable cardiac image segmentation: Assessing image-level and pixel-level segmentation quality via self-reflective references’, Medical image analysis, 78, p. 102426. doi:10.1016/j.media.2022.102426.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducible if the code is released.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Fig 1, the construction of the input to the image synthesis network is not that clear to me. Is it a multi-channel input or a single-channel input, which combines the heatmap with the edge map?
    • Page 4, “in default” → “by default”
    • Page 5, 3), please explain what is I? Identity matrix?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea is novel, though I have concerns regarding the design of the methodology, e.g., the instability of the optimization at test time, the missing stop criterion to prevent overfitting. As a result, quite a few failure cases can be found in the results.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The work has methodological advancements and the evaluation is sound and comprehensive; paper organization very good. The authors are encouraged to address the remaining questions on the method description, which were raised by the reviewers. Furthermore, failure cases should be discussed.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR




Author Feedback

N/A



back to top