Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Meng Zheng, Benjamin Planche, Xuan Gong, Fan Yang, Terrence Chen, Ziyan Wu

Abstract

3D patient body modeling is critical to the success of automated patient positioning for smart medical scanning and operating rooms. Existing CNN-based end-to-end patient modeling solutions typically require a) customized network designs demanding large amount of relevant training data, covering extensive realistic clinical scenarios (e.g., patient covered by sheets), which leads to suboptimal generalizability in practical deployment, b) expensive 3D human model annotations, i.e., requiring huge amount of manual effort, resulting in systems that scale poorly. To address these issues, we propose a generic modularized 3D patient modeling method consists of (a) a multi-modal keypoint detection module with attentive fusion for 2D patient joint localization, to learn complementary cross-modality patient body information, leading to improved keypoint localization robustness and generalizability in a wide variety of imaging (e.g., CT, MRI etc.) and clinical scenarios (e.g., heavy occlusions); and (b) a self-supervised 3D mesh regression module which does not require expensive 3D mesh parameter annotations to train, bringing immediate cost benefits for clinical deployment. We demonstrate the efficacy of the proposed method by extensive patient positioning experiments on both public and clinical data. Our evaluation results achieve superior patient positioning performance across various imaging modalities in real clinical scenarios.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_12

SharedIt: https://rdcu.be/cVRUS

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a method for 3D patient body modeling. To this end, the authors propose a framework to localize 2D keypoints with two branches for RGB and depth images and to estimate 3D mesh from the 2D keypoints. The proposed system using RGBD data shows a mean per joint position error (MPJPE) of 115 (mm) for 3D mesh regression, which is 22 mm lower than RDF [1].

    [1] Yang, F., Li, R., Georgakis, G., Karanam, S., Chen, T., Ling, H., Wu, Z.: Robust multi-modal 3d patient body modeling. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (2020)

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) There is a novelty in the network combining RGB and depth image features/heatmaps with intra-modal attention and inter-modal attention. 2) The application shown in Section 3.3 is appropriate and demonstrates the usefulness of the proposed framework.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The 3D mesh estimation description is difficult to understand, the reproducibility is poor, and there is no particular novelty in that part. 2) The credibility of the results is limited because the cross-validation is not performed on the data with a small number of patients.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The RGBD keypoint detection module does not seem challenging to implement, but its parameters are not fully explained. The 3D mesh estimation is challenging to implement.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    It does not seem easy to see that the experiment compared to OpenPose has a significant meaning. It is recommended to compare with more recent pose estimation approaches or to train state-of-the-art approaches with the given data. While training the RGBD keypoint detection framework, the magnitude of the difference between RGB-based and depth-based errors is ignored, which has room for improvement. The training data was composed of only a small number of patients (3 patients), but it would be good if the performance change could be shown according to the amount of training data.

    Several typos need to be corrected. Page 3 Given a RGB -> Given an RGB Page 5 (i.e., AMASS[25], … -> missing the close paranthesis Page 6. 3.1(1) inconsistent double quotation marks

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Most of the manuscript except for 3D mesh estimation is clear and easy to follow, and the results are plausible.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The manuscript presents an automatic approach to estimating the 3D mesh of a patient from a given RGB and depth image by proposing attentive fusion to fuse the RGB and depth image heatmaps to calculate 2D keypoints and heatmaps. The estimated 2D heatmaps and 2D keypoints are further passed to a regressor to estimate the SMPL parameter of the body mesh.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The manuscript presents a medically relevant problem of estimating 3D mesh from the multi-modality depth images. The manuscript is well written and presented. The authors have proposed the attentive fusion approach that accurately fuses multi-modality heatmaps helping in achieving the final 3D mesh reconstruction.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • A fair comparison with the current state-of-the-art is missing
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Datasets and code: The authors used a combination of public and the private dataset. The authors have neither provided nor mention the availability of the private dataset, models, training/evaluation code upon acceptance.

    • Experimental results: No result on the different hyperparameters setting or on the sensitivity of hyperparameters on the results. The authors used fixed hyperparameters.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The authors should consider addressing the following points.

    1. The proposed methodology of using attentive fusion is novel and appropriately combines the RGB and depth heatmaps for multi-modality 2D human pose estimation. The authors obtained better results than the state-of-the-art RDF model. However, The RDF model uses the Resnet-50 backbone features from various modalities, whereas the authors have used a more accurate and high-capacity HRNet model. So whether the improvement is coming from a better backbone or the proposed attentive fusion is currently unclear. For a fair comparison with the RDF model, authors should consider either using the same backbone model or using their proposed fusion method in the RDF model.
    2. The authors have used the “Depth Keypoint Detection branch” to estimate the 2D keypoints from the depth image. However, a depth image provides a better 3d representation encoded in the depth values. As the final aim is to estimate the full 3D mesh, using only the spatial 2d heatmaps from the depth might not fully utilize the full potential of the depth image. The authors should consider exploiting more appropriate 3D features from the depth image for the 3D mesh predictions.
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is very well written and addresses the challenging problem of 3D mesh estimation from multi-modality images. The only concern in the proposed fusion approach is a fair comparison against the state-of-the-art RDF model. The authors should consider addressing the issues described above.

  • Number of papers in your stack

    6

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper describes a CNN-based approach for 3D patient modelling from RGB-D acquisitions under challenging conditions, i.e. different clinical scenarios. The main contribution is a more efficient supervision of the CNN, achieved by splitting the 3D model generation in two steps, which can be supervised individually: First, joint keypoints are detected from RGB, D or RGB-D inputs using an existing 2D keypoint detector and a fusion module, and the authors show how this can be trained using unsupervised pretraining and a relatively small number of labelled training data. For 3D mesh regressing, they also use an existing architecture, and describe an approach for generating synthetic 2D joint and mesh parameter pairs for training the mesh regressor in a self-supervised fashion. They evaluate their approach by comparing against other 2D joint and 3D mesh regression methods, showing good generalizability of their method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • The training strategy proposed in the paper is very clever. To the best of my knowledge, the split supervision of 2D keypoint detection and 3D model generation, the latter of which can be supervised with synthetic training data only, is a novel concept. Aside from the proposed clinical setting, where it is used to overcome challenges associated with variability in patient positioning for different imaging modalities and coverage of the patient with sheets, the same strategy could also be applied for other, non-medical but non-generic scenarios, in which current state-of-the-art methods perform poorly. • The paper is well motivated and timely. Scanning automation is an increasingly important topic, especially considering the current situation, where clinical staff is typically short, and it can be advantageous to avoid too much close contact between staff and patient. Although many pose and human shape estimation methods exist, generalizing them to challenging clinical conditions seems to be an open problem. • The paper is well written and pleasant to read. I could not spot significant grammatical, spelling errors or typos.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • I think the experiments part in this paper could be stronger. The entire section is not as well written and structured as the remainder of the paper. Some choices and methods are not entirely clear from the description (see detailed points). • The clinical significance and impact of the achieved results is not discussed. It remains unclear whether the method satisfies clinical demands, and if not, which limitations remain. • Although the number of required labelled training data can be significantly reduced by the approach, still, a considerable amount of labelled training pairs from clinical, in-bed pose images are needed to train the network. Aside from a public dataset, the authors use a large collection of proprietary data. For most researchers aiming on using or building upon the presented approach, such a dataset will be difficult to obtain.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Reproducibility of the method is overall given. Although the description of methods lacks many details in the main paper, they are provided in the supplementary material. The results in the paper could likely not be reproduced, since the authors use proprietary training and testing data. It would be very valuable for the field if this data would be made publicly available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    Methods • The loss function is shown in Fig. 3, however, it not explained anywhere in the main paper. Maybe it’s definition and explanation can be moved to supplementary entirely, as going back and forth between supplementary and main paper is tedious for the reader. Experiments and Results • It is not clear on which dataset the baseline methods were trained. To they use SLP, SLP + proprietary data, or are off-the-shelf models used? None of the baseline methods except RDF use the SLP dataset in their original models, so the comparison in terms of methodology would not be entirely fair for off-the-shelf models. • Following up on the previous point, general purpose pose estimation methods, such as OpenPose, need to generalize to a much greater variability of poses, compared to only lying, in-bed poses in SLP and the proprietary dataset. Maybe other works focusing on in-bed poses [1,2] could be considered for a comparison. • The work in Ref. [36] in the paper was followed up by the same group with Ref. [15] in the paper. Why was [36] chosen as comparison? Particularly considering that [15] also uses RGB-D modalities and shows very encouraging results. • It seems like the PVE-T-SC metric is missing from Table 1. • In Table 4, why are only results for the MI dataset presented, what about CT or MRI? Why was the head pose omitted from the evaluation? Conclusion • An interpretation of the results in terms of clinical significance is missing. For someone unfamiliar with the topic, the presented metrics show superior performance of the approach compared to other methods, but their clinical interpretation is not clear. It would be good if authors could comment on the clinical significance of the results. Does the approach already satisfy the demands of a clinical application? Which error ranges would be considered acceptable? While I like the clinical evaluation in section 3.3, this is especially true for these metrics. • Following up to the previous point, if the results are not yet satisfactory for a clinical application, limitations and areas for future research should be discussed as well.

    [1] Yin, Y., Robinson, J. P., & Fu, Y. (2020). Multimodal in-bed pose and shape estimation under the blankets. arXiv preprint arXiv:2012.06735. [2] Clever, H. M., Erickson, Z., Kapusta, A., Turk, G., Liu, K., & Kemp, C. C. (2020). Bodies at rest: 3d human pose and shape estimation from a pressure image using synthetic data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6215-6224).

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    5

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a solid contribution to the field and the results are encouraging. The flaws in the evaluation and description thereof, and the lack of critical discussion of the results, in particular in terms of clinical significance, prevent me from giving the paper a higher score.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper has received three detailed reviews, all of which agree that the method carries some novelty, has a compelling clinical motivation, and is presented clearly. There are some concerns around the experimental design, especially the lack of strong baselines, and overall clarity of this section; further, the implications of the current system performance in the context of the clinical application is not dicussed. While overall the strenghts seem to outweigh the shortcomings, I would suggest some minor editing of the manuscript to address the concerns raised during review.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2/17 (>10th percentile)




Author Feedback

We thank reviewers R1, R2, R3 for their valuable comments and present some clarifications to their questions.

SIGNIFICANCE One key value of the work is the self-supervised synthetic training of mesh regressor, which greatly reduces annotation efforts. From [R3-Q8] and as shown in Tab.5: for different CT protocols, our pipeline automatically aligns the center of target body part and scanner isocenter with mean errors of 5.3/7.5/8.1mm for abdomen/thorax/head resp. compared to GT isocenters (median error of radiographers is 13.2mm [e]). Our automated positioning can improve scanning efficiency/quality (higher accuracy than manual positioning, less effort), possibly reduce patients’ radiation intake, and enables contactless scanning to minimize staff’s risks wrt infectious diseases. It can further improve/automate other scanning workflows, e.g., auto estimate exposure parameters, scout-free scanning.

SOTA COMPARISON [R1] Though proposed in ‘16, Openpose is still popular, continuously updated[a,b,3] and maintained. We posit comparing to it is meaningful, and we further compared our detector to recent HRnet[33] (Tabs.1-2). We will add more SOTA comparisons and results using ResNet50 backbone (same as RDF, whose code is not released), with/without our cross-modality fusion to further prove its generalizability in final version. [R3] SOTA mesh recovery comparison: as shown in Tab.1, HMR[14] and RDF[36] are trained on SLP[22]. Note our proposed workflow requires only 2D labels (weak supervision, c.f. Sec.3), while competitive methods [14,36] require 2D + 3D mesh annotations (very expensive) for end-to-end training. We are able to achieve on par/superior 3D mesh recovery with less supervision. In [15] Tab.V, the authors show 3D MPJPE of 137 & 107mm for RDF & RDF-OPT (w/ additional offline postprocessing OPTimization) on SLP while our method achieves 115mm without any postprocessing. Our results (likely) improved with this post-processing[15] will be added to final version. We thank R3 for pointing out. [R3] Compared to others [c,d] targeting lying poses, our work relies on less expensive modalities (RGB+D vs. pressure data in [c]) and is more practical/generic than [d] since: 1) We separate 2D keypoint detection and 3D mesh regression processes, inferring meshes from robustly-detected 2D keypoints (leveraging large-scale public 2D datasets), thus enabling patient mesh reconstruction without expensive 3D mesh annotations (unlike [d]) and better generalizability in real-clinical applications; 2) Our RGBD keypoint detector is more flexible, not requiring paired RGB/D data for keypoint/mesh inference, while [d] requires both Depth/IR images to properly infer. Lacking one input modality (common in clinical scenarios) will largely affect performance in [d] whereas our system still produces satisfying results (Tabs.1-2).

REPRODUCIBILITY The cross-validation of keypoint detection and mesh recovery (Tab.2) on our proprietary CT dataset was performed on 13 subjects [R1-Q5.2]. All implementation details [R1-Q5.1] were provided in sup-mat. Hyperparameters were determined by cross-validation [R2-Q7].

MISC We will modify Fig.3, Tab.2.B, Tab.4 to make them clearer [R3-Q8]. For Tab.2.B, the authors of [36] did not provide PVE-T-SC eval for RDF and HMR nor corresponding code for reproducibility, so we omitted these comparisons. Tab.4 does not include CT/MRI results due to space limit (they are in sup-mat); will add back to main paper.

REFERENCES a)Cao etal, Realtime multi-person 2d pose estimation using part affinity fields, CVPR17 b)Simon etal, Hand keypoint detection in single images using multiview bootstrapping, CVPR17 c)Clever etal, Bodies at rest: 3d human pose and shape estimation from a pressure image using synthetic data, CVPR20 d)Yin etal, Multimodal in-bed pose and shape estimation under the blankets, arXiv20 e)Booij etal, Accuracy of automated patient positioning in CT using a 3D camera for body contour detection, Eur. Radiol, 2019



back to top