Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Yufan He, Dong Yang, Andriy Myronenko, Daguang Xu

Abstract

The training hyperparameters (learning rate, augmentation policies, e.t.c) are key factors affecting the performance of deep networks for medical image segmentation. Manual or automatic hyperparameter optimization~(HPO) is used to improve the performance. However, manual tuning is infeasible for a large number of parameters, and existing automatic HPO methods like Bayesian optimization are extremely time consuming. Moreover, they can only find a fixed set of hyperparameters. Population based training (PBT) has shown its ability to find dynamic hyperparameters and has fast search speed by using parallel training processes. However, it is still expensive for large 3D medical image datasets with limited GPUs, and the performance lower bound is unknown. In this paper, we focus on improving the network performance using hyperparameter scheduling via PBT with limited computation cost. The core idea is to train the network with a default setting from prior knowledge, and finetune using PBT based hyperparameter scheduling. Our method can achieve 1% ~ 3% performance improvements over default setting while only taking 3% ~ 10% computation cost of training from scratch using PBT.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16443-9_54

SharedIt: https://rdcu.be/cVRy8

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    Authors propose a hyperparameter optimization method that performs local search by using best performing checkpoints from an initial training run and then, in parallel, retraining the model from the checkpoint using multiple hyperparameter sets sampled using Tree-Parzen Estimators. The method does not require expensive retraining (from scratch) every time a different hyperparameter setting is explored. The method was tested using several datasets from the MSD challenge.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • the method to use multiple checkpoints, and retraining models to populate the configuration space for TPE appears to be a novel application for hyperparameter optimization
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • authors have not convincingly explained why results are worse than the default setting, when the number of workers were small (W-9)
    • the authors are making an assumption that a good set of hyperparameters exists for a given model/problem which may not be the case. If this default set of hyperparameters used is not good enough, I am not sur eif this method would work well. If authors can show that this is not the case, that will make the method more convincing. For example, given a task and a model, if the authors were to use a random set of hyperparameters to initially train the model, and apply their hyper parameter optimization technique, will the method still yield a good enough solution compared to when a good set of initial hyper parameters were used to train the model the first time ?
    • As such the method seems to be useful to find incremental improvements to a model, rather than help find a set of optimal hyperparameters as it appears to be a more localized search than a global search in the hyperparameter space.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    No code or implementation has been shared. Public dataset used. Reproducibility status unknown.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • it seems that from the set of hyperparameters that were used in Table 1, there are none that are related to the architecture of the models - for example the activation functions or number of layers, or filters. Could you explain why these specific hyperparameters were chosen ?
    • explain why when the number of workers are low, that you do not obtain better results than the default setting .
    • could you provide test results on a held-out test set for the optimized models vs default model. It seems you have only reported validation results and not test results. I’m interested to see how much of a test performance gain can be seen if the model was retrained from scratch using the “optimized” set of hyper parameters found using this technqiue
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    the utility of this method seems limited to cases where models with good initial hyper-paramaters are already known. We cannot assume this is the case for all problems.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #4

  • Please describe the contribution of the paper

    This paper presents a hyperparameter optimization technique, based on population based training (PBT), that reduces the training cost of original PBT, to make it feasible for large 3D medical images. They key idea is to start from a set of default parameters (chosen due to prior knowledge) instead of random parameters, which reduces the number of workers needed for PBT to converge. The method is evaluated on 4 tasks of the medical segmentation decathlon, and on 2 network architectures. It shows slight improvement in Dice Score (1-3%).

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • the claimed reduction of computation cost to 3%-10% of original PBT is impressive
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • the main novelty of this method is to start PBT from default hyperparameters, which already achieve good performance. This restricts the use of the method to settings where ‘good’ hyperparameters are already known and can be used as a starting point.
    • The authors acknowledge that the performance improvement using their method can be quite small, it might be mainly useful to tune hyperparameters to compete in challenges
    • a performance comparison of the proposed method, original PBT and default training is missing.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    the authors plan to publish their code. Data used is public

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • since the search space of the hyperparameters is restricted by a range around the default values, I wonder how this method compares to simple grid search around the default values. This could be discussed in the paper, by e.g. comparing to grid search.
    • the authors mention an average performance increase of 1-3%, but the actual performance on the 4 datasets is not mentioned in the paper. Figure 3 gives some hints of performance on the validation set, but it would be helpful for the reader to see the performance e.g. in a tabular form. Also a comparison with original PBT in tabular form would be desired.
    • Figure 3 is mentioned before Figure 2 in the text
    • on page 3 the authors claim that their method reduces the training cost to 3%-10% of original PBT. Could the authors elaborate how they estimated this amount of reduction?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a modified PBT, with restricted hyperparameter search space and starting from a set of default parameters. This does reduce the computational overhead of PBT – which makes it feasible to train for 3D medical networks, but it also comes with the drawback of the restricted search space. The performance improvement over the default values is small, and the authors do not compare with other simple optimization approaches like simple grid search.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    The performance improvement over the default values is small and the method is only applicable when ‘good’ hyperparameters are already known and can be used as a starting point.



Review #5

  • Please describe the contribution of the paper

    This paper utilizes Population based training (PBT) for hyperparameter tuning for medical image segmentation models. From a default setting, the tuning is able to achieve performance improvements and save 90%~97% computation cost of training from scratch compared to the original PBT.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper uses an existing technique to solve a practical hyperparameter tuning problem. There is potential value to applying it in many other applications and accelerating the development of new deep learning models in the future.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weakness of the paper is the limited validation and novelty. Although it is a good idea to incorporate PBT into medical imaging using prior knowledge and can obtain consistent improvements, the main value of the method is quite practical and lacks methodological contribution. That makes the current version of the paper less interesting. However, this can be a good paper if more solid validation is provided, e.g. improves nnUnet in more medical imaging challenges.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors would like to make all codes publicly available, which ensures good reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    *Why there is no pbt best worker from scratch (W=27) in UNet Lung in Figure 2? *Add significance test to the results. Repeat the experiments to remove the randomness of the parameter searching. *Evaluate it in more challenges.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a good hyperparameter tuning strategy based on PBT. However, the current version needs more solid validation to prove its efficiency.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    3

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The rebuttal makes the paper more convincing. However, as a hyperparameter tuning method, more solid validation on real testing data in different challenges are still needed.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper presents what could be an interesting idea with some potential for future development but there is no strong conviction among the reviewers about the applicability and the restrictions of the method. There are further some concerns about the explanation for the limited improvement observed at validation and the lack of statistical testing. These aspects should be addressed thoroughly in the rebuttal

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    9




Author Feedback

We thank reviewers for the comments. There are few works using HPO for segmenting large medical datasets because the computation cost is unaffordable. As shown on Page7, training one trial on Task1,6,or 7 takes 9 hours on 8 V100 GPUs. A 27 worker PBT costs 10 days(9x27 hours). That’s why we don’t have the PBT(W=27) from scratch for Task6 Lung tumor (answers R#5, reason mentioned on Page6). For grid/random search, much more trials are needed to cover 14-dimensional hyperparameters(HPs) space(Table1) which can take months. Without enough GPU hours, they can have much worse results than expert-designed HPs(R#2’s suggestion). PBT is a greedy method, small worker number (10 or below) tends to encounter higher variance and suffer from poorer results (original PBT paper Ref[10], answers R#2). However, we don’t know what worker number is enough for a new dataset and the convergence lowerbound is unknown (Ref[23]). Fig.2 and Appendix show that W=9 PBT from scratch can have worse results than default training on all tasks, although we spent 3 days/8 V100 for each task1,6,7. This is a big problem since we may spend more GPU hours to tune the parameters (including worker numbers) for PBT.

R#5 thinks we lacked novelty. To reduce HPO cost, existing methods try to improve sample efficiency (Bayesian model) or evaluate HPs with fewer epochs (multi-fidelity). We explained why those cannot be applied to PBT and why starting from pretrained checkpoints makes sense. If you read the introduction carefully, our method rethinks HPO/PBT from a new perspective. It has a clear rationale and intuition. None of the existing work can provide the efficiency and lower bound guarantee as we do, and it makes HPO for performance improvement applicable in real practice, which is indeed novel.

R#4 criticizes “starting from and restricted by default HPs”. This is not correct. The PBT starts from pre-trained model weights and the range of HPs in Table1 is not restricted around the default. We admit the limitation of using “default HPs” pointed out by R#2 and R#4. The default HPs in our case serves as “priors”, it is always required for finding shortcuts in optimizations(“no free lunch”). Researchers add “priors”, e.g. topology/shape, to tasks with limited training data, which shouldn’t be criticized. This is the trade-off between resources and algorithms. If we have unlimited GPUs, we can try unlimited HPs without using any prior. As stated on Page3, many HPs can be transferred and decided automatically in medical imaging. DiNTS (Ref[5]) uses the same HPs across all ten tasks. nn-UNet automatically generates HPs. Both of them achieve top results on the leaderboard. So we don’t think these “default HPs” are a huge obstacle to applying our method.

All reviewers have concerns about test results/limited validation. For AutoML(HPO or neural architecture search, NAS), the training purpose is to optimize architectures/HPs to get better validation accuracy, and it is a common practice to use validation/loss curve as evidence of “better HPs” as in HPO papers (Ref[4,10,15]). Experiments on 4 tasks (3 large challenging tumor datasets in multi-modal MRI, CT) and two networks clearly showed the efficacy in improving validation accuracy, thus we don’t agree with the “limited validation” from R#5.

We agree with the comments about lacking test results. MSD test labels are not provided and the submission is closed unexpectedly. We’ve contacted the organizers for reopening. Even so, those 8 experiments can prove the applicability of our method.

For other R#4 comments about comparisons with original PBT and default training, they are shown in Fig.2 and the Appendix. The actual dice performances are listed in Fig.3. to save space. The cost reduction is based on total epochs and is presented on Page 7. R#1 asks about HPs(Table1) selection. We focus on HPO, not NAS, so we include almost all HPs except ones related to architecture (PBT can’t handle changing architectures).




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors made an appropriate rebuttal to the concerns regarding applicability of the method. This pushes the paper towards the line of acceptance for a conference paper with the presentation of a new and interesting idea although more work is needed for an adequate validation

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    7



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper introduces an automatic hyper parameter optimization method. Although the method seems interesting, the reviewers showed some concerns on applicability and insufficient validation. The authors addressed these issues mostly well in the rebuttal. One reviewer changed the rating from 4 to 5 after the rebuttal, so the overall rating is more towards acceptance, which I agree with.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    This paper utilizes Population based training (PBT) for hyperparameter tuning for medical image segmentation models.

    The results are not super convincing and 2/3 maintain their rejection score after rebuttal. As 1 reviewer stats: The rebuttal makes the paper more convincing. However, as a hyperparameter tuning method, more solid validation on real testing data in different challenges are still needed

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Reject

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    15/20



back to top