Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Stefano B. Blumberg, Hongxiang Lin, Francesco Grussu, Yukun Zhou, Matteo Figini, Daniel C. Alexander

Abstract

We present PROSUB: PROgressive SUBsampling, a deep learning based, automated methodology that subsamples an oversampled data set (e.g. channels of multi-channeled 3D images) with minimal loss of information. We build upon a state-of-the-art dual-network approach that won the MICCAI MUlti-DIffusion (MUDI) quantitative MRI (qMRI) measurement sampling-reconstruction challenge, but suffers from deep learning training instability, by subsampling with a hard decision boundary. PROSUB uses the paradigm of recursive feature elimination (RFE) and progressively subsamples measurements during deep learning training, improving optimization stability. PROSUB also integrates a neural architecture search (NAS) paradigm, allowing the network architecture hyperparameters to respond to the subsampling process. We show PROSUB outperforms the winner of the MUDI MICCAI challenge, producing large improvements >$18 \% $ MSE on the MUDI challenge sub-tasks and qualitative improvements on downstream processes useful for clinical applications. We also show the benefits of incorporating NAS and analyze the effect of PROSUB’s components. As our method generalizes beyond MRI measurement selection-reconstruction, to problems that subsample and reconstruct multi-channeled data, our code is https://github.com/sbb-gh/PROSUB

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16446-0_40

SharedIt: https://rdcu.be/cVRTz

Link to the code repository

https://github.com/sbb-gh/PROSUB

Link to the dataset(s)

https://www.developingbrain.co.uk/data


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper addresses the joint problem of sampling scheme optimization and signal reconstruction/q-space superresolution posed in the MUDI 2019 Challenge. The authors present a method improving upon and comparing with the challenge winner (SARDUNet). Similar to SARDUNet, they use two MLPs (one for subsampling and one for superresolution) and introduce two changes: 1. an improved iterative method to build the subsampling mask leaning on RFE and 2. a hyperparameter optimization scheme which (presumably) at its core increases the network capacity (i.e. the number of parameters). In a quantitative signal evaluation (MSE) the method achieves better results than the baseline. Additionally, several qualitative downstream analyzes reaffirm the improved reconstruction quality.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper recognizes signal reconstruction is only a means to obtain downstream results and provides qualitative downstream results. The paper adopts the task framing from the 2019 MUDI challenge and compares from the winning baseline from that challenge. This work presents results clearly superior to the baselines, however the reason for the improvements is unclear (see weaknesses).

    The RFE idea seems to be a good idea, but it is unclear at this point. (I would assume this is connected to “Alg.-line 8 is m et = max {m t − (e − E d )I e−E d · I i∈D (i), 0}”, see Table 2; Judging from Table 2, this would be the critical bit, but it remains a guess)

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The merits of this paper are difficult to analyze/understand between changing network capacity, interacting (because joint) mask and hyperparameter optimization, writing style, and evaluation metrics. Despite the additional Ablation Table in the Suppl. Mat.s (which is unclear), it is difficult to attribute the performance gains.

    Misleading performance numbers and unclear “reason” for performance. The difference in MSE between the SARDUNet paper [15] and here are unclear ([15] reports 5-8x better MSE in Table 2, making a better description of the evaluation mandatory). “altering the second network’s input across different batches, producing instability” indicates a problem in the setup of SARDUNet (randomly changing network inputs are incompatible with a MLP). Across all tested architectures, the authors employ MLP architectures. MLPs scale extremely well for Diffusion SuperResolution, so increasing both the number of hidden layers and/or features (also units) are always beneficial (even beyond the depth of 4 used here). While obfuscated and ignored in the paper, the performance increase by more layers/features (=more parameters= network capacity) is – in my eyes – trivial. To clarify this, authors should 1. report the network size (parameters, maybe even FLOPS), and 2. Normalize different architectures to one network size (In contrast, finding the “sweet spot” for the ratio of hidden layers to features/units as an ablation/hyperparameter would be useful). As is, I expect the “largest” architecture (which coincidentally is larger than the baselines) to achieve the best performance (This seems to explain improvements for M <=50).

    Performance of SARDUNet-NAS. Since the SARDUNet hyperparameters are in the superset of the SARDUNet-NAS hyperparameters, the fact that SARDUNet-NAS does not consistently achieve the performance of SARDUNet implies the used hyperparameter search is not stable.

    Quantitative downstream performance metrics. Signal MSE can be misleading in light of the signal-inherent noise, so quantitative downstream analysis is required to verify the validity of predictions, e.g. FA, NODDI, fODF, etc.

    The results (table 1) seem to be an MSE across N=1344, which implies all 1344 DW signals are predicted despite M = {500, …, 10} being in the input. For these M signals the optimal network behaviour is to return the “noisy” input value. This implies 1. performance numbers might be biased (the difference between M=500 and M=100 might be an overfit to those M noisy input signals) and 2. it is optimal to use high-noise DWIs as input to reduce the impact of the measurement noise (e.g. use high b-value measurements which typically are low SNR). As a consequence, I believe the direct comparison in Table 1 is invalid (this effect becomes more dominant with increasing M, maybe explains performance in M>50). [see also details]

    Is this paper employing NAS? This is incorrect terminology in my opinion. [see details]

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Run times. Run times are provided to some degree in a text file to the code, but I would expect a different format for the reporting: There run times are reported as time for a single cross-validation split, but with “NAS” (or hyperparameter optimization) employed, instead the TOTAL GPU hours/days should be reported, because that is what is required to achieve those results. I am fine with estimates, but this is insufficient.

    “MRI signal prediction MSE” is missing a definition or a reference to a definition that is very clear on what is compared.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    “network architecture hyperparameters e.g. number of layers and hidden units, is a task-dependent problem” – NO. As long as your dataset is large enough, more layers etc. ALWAYS increases performance. This is not task-dependent. The balance between hidden layers and number of features is task dependent though.

    NAS terminology. This paper “only” performs a (smart) hyperparameter optimization (which also falls into the broad category of AutoML, which notably is also the terminology used by AutoKeras [20]). Both the supplementary materials and the code imply, that the following, critical NAS characteristics are missing: changes to the topology (e.g. skip connection), different choices for layers (see also NASNet, DARTS, such foundational papers are completely missing from the related work). This issue can be easily fixed by replacing neural architecture search (NAS) by Hyperparameter Optimization (HO) or AutoML.

    Evaluation. The framing of the challenge makes a clean evaluation difficult, as some biases are engrained in the challenge framing. As is, I would strongly recommend to exclude a subset of signals (for some direction + b-value) from the selection (1st) network manually, but include them in the second as outputs only and only report results on these “calling them test signals”, otherwise your network is encouraged to recreate measurement noise. I understand this will be out of scope for a rebuttal. At a minimum, we would need to understand inhowfar the noise characteristics of the different signals differ. I.e. I believe, the direct comparison in Table 1 is invalid. Example for SOLUTION: New Table, which reports the performance grouped by b-value (as a proxy for SNR) and whether the ground truth signal was in the input or not. Table 1 does not have to list results for 8 different values of M (I’d be happy with 4 to make space for a second table, also M>=100 is in my opinion irrelevant for the superresolution aspect).

    Generally, the writing is often unclear/hard to follow/imprecise. To just understand the text, most of the paper has to be reread multiple times. To illustrate: “PROSUB has an outer loop: steps t = 1, …, T where we simultaneously perform NAS and RFE, choosing the measurements to remove via a score, averaged across the steps, whilst simultaneously updating the network architecture hyperparameters.”

    • Long (check), convoluted (check), number of verbs (6 in one sentence) with parts that is not even a proper sentence (steps… where)

    “PROSUB is not limited for subsampling MRI data sets” -> “PROSUB is not limited to subsampling MRI data sets” “We determine this by (i) by constructing” -> “We achieve this by (i) constructing” “Recursively over steps t = 1, …, T RFE prunes the” -> “Recursively over steps t = 1, …, T, RFE prunes the”

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    2

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Misleading/unclear performance claims

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    4

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    I generally like the rebuttal. Unfortunately, the comments to “improvement … consequence of increased network capacity” are somewhat unsatisfactory to me: Extensive testing in our lab (on our code and our dataset) has shown that the task of diffusion reconstruction scales well with network size an we could not detect any overfitting issues. I find the answer of the authors a bit ambiguous as well (i.e. it is not clear to me if they deny the bigger network is the reason to the improved performance). In the rebuttal, I asked for the numbers of the network sizes, which I did not get. I updated my rating on the assumption that authors 1. quantify core network sizes in Table 1 (i.e. number of layers and features size AND/OR number of parameters) and 2. also provide experimental numbers for PROSUB at maximum network size (from the search space, see Table 3 in the Supplementary Materials).

    Also, I would maintain my evaluation that MSE is sometimes misleading. If accepted, I would appreciate if additional numbers provided in the rebuttal found their way into the final version.



Review #2

  • Please describe the contribution of the paper

    This work introduces an improved neural-network based method for solving the MUDI2019 challenge task of recovering a large set of volumes from different parameter combinations of diffusion MRI measurements from a smaller set of volumes. Two neural networks are used in this method, where one is tasked with choosing the smaller set of measurements and another one is tasked with reconstructing the large set from the chosen set. The improvements over previous algorithms are in the training scheme especially for the network choosing the smaller set. Here, a exponential moving average is used to gradually select the smaller set and a learning schedule to gradually decrease the cardinality of the set. In addition the training is embedded in a neural architecture search component. The method is evaluated on the MUDI2019 challenge data against versions of the winner of this challenge.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The method is well described and the paper well written. The description of the method is very precise in algorithm 1.

    The building blocks for improvement over previous methods like the exponential moving average for sample selection and the neural architecture search are well motivated.

    I appreciate the effort making an anonymous version of the code available during review and the long-term usable links to the relevant websites.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I am unsure how relevant this task is for clinical applications. I do understand, that being able to reconstruct all those measurements from a subset allows for a fast acquisition. But what is the value of having access to all those parameter variations? From my point of view, if one is able to accurately recover those additional measurements the information content of those additional measurements must be negligible. So why would I recover those? For applications calculating properties of the diffusion tensor this should hold true as well.

    The introduced improvements over previous algorithms are well motivated but the neural architecture search leaves the impression of being orthogonal to the work on this method. The architecture search would be expected to improve any method based on neural networks independent of the task. So I am unconvinced, whether it adds substantially to the state-of-the-art. If this is taken into account, this work leaves the impression of an incremental improvement over previous methods.

    Evaluation only considers variations of a method which previously won the MUDI2019 challenge based on neural networks. It is hard to judge the merits of these methods in absence of other methods.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Excellent. The authors even went to the lengths of making their code available to reviewers in a way respecting anonymity. Data is also available and the description of the method is detailed.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    This works appears as an excellent engineering paper, which sets some best-practice standards. My biggest concern is with the importance of the task it sets out to solve. This concern may be adressed by explaining it clearer in a rebuttal and also adding explanatory sections in the paper. In addition, the evaluation leaves a very one-sided impression. All the comparison methods are variants of SARDU net. It would be more convincing if other methods were used. E.g. one simple comparison could be a simple linear model where each volume is predicted by a linear combination of the set of subsampled volumes. The set of volumes for the subsampling could be determined e.g. by random search. Alternatively I expect a look towards compression methods and dictionary learning methods should yield strong alternative baselines.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I doubt the motivation of the task to be solved. If I could learn about the significance of the task or application of the method on a more obviously useful compression problem my opinion could change. However, I also highly doubt the evaluation. Since the task is not well studied, I believe only comparing it against versions of one related method leaves a high risk that many existing methods could perform similarly well.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    The authors correctly pointed out to me that by using the challenge evaluation protocol their evaluation is much stronger because it includes other baselines from the challenge. I still doubt the motivation of the challenge task itself but this opinion seems to be singular among my colleagues. Therefore, I raised my score to weak accept.



Review #3

  • Please describe the contribution of the paper

    The authors propose a method to recursively eliminate features to know the limits of subsampling

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Replacing a weighting scheme with an elimination of features is a more robust implementation approach
    2. Comparison with the SARDU-Net and related new versions is commendable and shows improved stability and performance
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. It is important to discuss how this subsampling scheme might be used during real-world application (prospective deployment) towards an accelerated acquisition.
    2. Difference images in figure 2 will enable comprehensive comparison as well as SSIM values
    3. A visual interpretation during the iterative elimination of the features will be useful to enable explainable AI
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The reproducibility criteria has been met

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The authors have used a meaningful approach to solve the dual problem over a standard dataset, especially improving the performance of a previous winning submission and related versions. The authors are requested to look at the strengths and weaknesses section for further comments.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    8

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors have demonstrated improvement over an award winning submission with a different approach that is easy to implement although recursive. The quantification of the results can be improved as in the comments.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    Reviewers thought that the idea of using recursive feature elimination for progressive subsampling is plausible, and appreciated the fact that, when combined with hyperparameter optimization, it led to a clear improvement compared to the winning entry of the MUDI 2019 challenge. On the other hand, reviewers thought that the practical relevance of the task in that challenge, and of the proposed evaluation metrics, should be motivated more clearly. They also thought that the observed improvement might be primarily a trivial consequence of increased network capacity, they were concerned that MSE quantification might include the known inputs, and they thought that the term “neural architecture search” was improperly used. I think authors should be given the opportunity to comment on these points in a rebuttal.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2




Author Feedback

We appreciate the constructive feedback. We are pleased with R3’s highest score and Meta Reviewer (MR) ranking our paper second. We address MR’s points:

MR: practical relevance of the task in that challenge should be motivated more clearly R2.3 relevant … for clinical … value … access to all those parameter variations? R2.6, R2.8: doubt the motivation … If I could learn about the significance of the task … my opinion could change

Our task and data, and thus motivation, is identical to MUDI challenge [1] and >2 pages description in paper [26]. We will also add: we want to obtain economical, but maximally informative acquisition protocols for any model that the full data set supports. Furthermore, unlike classical approaches, we approach the experiment design problem in a new model-independent way, seeking the subset that best supports re-estimation of the full data set.

MR: evaluation metrics … motivated more R1.3 MSE can be misleading … quantitative downstream analysis … e.g. FA, NODDI, fODF MR: MSE quantification might include the known inputs R1.3 (table 1) … 1344 DW are predicted … performance numbers might be biased [difference M=500 vs M=100] … overfit to … input … comparison in Table 1 is invalid

For fairness, we used the same evaluation as the MICCAI MUDI challenge [26] sec 2.2, 2.3: MSE on reconstructed entire signal. We presented qualitative parameter maps from a random subject in figures 2,4. We will add quantitative improvements, Baseline MSE vs Our MSE: 0.022 - 0.007 NODDI ficvf, 0.023 - 0.005 NODDI fiso, 0.020 - 0.013 NODDI odi, 0.006 - 0.004 FA. MR comment 2 is correct, we disagree with R1.3 as the same evaluation is for all the networks in table 1, without input subset in the reconstruction we have similar, large improvements.

MR: improvement … consequence of increased network capacity R1.3 obfuscated and ignored … performance increase by more layers/features (=more parameters= network capacity) is – in my eyes – trivial R1.3 Normalize different architectures to one network size … expect the “largest” architecture … larger than the baselines to achieve the best performance … explain improvements for M <=50

This is incorrect. With more capacity/`larger’ networks may overfit, or have poor optimization, worsening performance. We see this empirically as SARDU-Net used a smaller selector network for M < 500 compared to M=500 to improve results, verified from the public code [4].
Furthermore, in table-1 for MUDI challenge M we achieve large improvements with PROSUB w/o NAS, network hyperparameters are fixed to the original SARDU-Net and have same number of parameters, so our three non-NAS contributions improve results. We were scrupulous in using the SARDU-Net on the task it was designed for and won, we expect its hyperparameters and network structure to have already been tuned. Examining architecture choices for small M, we hypothesize it chooses a better structure to avoid overfitting, rather than greater no. parameters.
We will clarify the paper.

MR: “neural architecture search” [NAS] was improperly used R1.3 NAS … incorrect terminology R1.6 issue can be easily fixed by replacing … (NAS) by … AutoML

We only optimize the architecture hyperparameters no. units, no. layers, we will take R1’s advice.

R1.6 writing is often unclear/hard to follow/imprecise

Both other reviewers disagree, clarity and organization R2: Excellent, R3: Very Good, nevertheless we will address specific issues R1 raises.

R2.6 R2.8 [use other baselines] E.g. … simple linear model where each volume is predicted by a linear combination … set of volumes for the subsampling … random search

We outperform the MICCAI MUDI challenge winner, thus all other entrants in [26], including approaches suggested by R2. It is unclear which baselines to use, R2 does not suggest anything specific, we expect challenge entrants to have chosen the best from a wide variety of methods.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I agree with the authors that a well-defined task from a recent challenge is a suitable basis for evaluation. The rebuttal has also led two of the reviewers to raise their scores. There is some remaining debate about the impact of network capacity, which is not as easily dismissed by referring to overfitting as the authors appear to assume – I would encourage them to look into the double descent phenomenon. However, I think the paper is acceptable for MICCAI in its current form, and this discussion could be continued in future works.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The main concern in the reviews is the motivation of the underlying challenge. While this might be debatable for many papers, the submission seems to be methodological sound and performs well. The authors clarified certain points in their rebuttal and R2 raised their score after discussion.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    Weaknesses were nicely adressed in the rebuttal. This work looks to me like a valuable contribution to MICCAI.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    upper



back to top