Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Cheng Zhao, Richard Droste, Lior Drukker, Aris T. Papageorghiou, J. Alison Noble

Abstract

Ultrasound~(US)-probe motion estimation is a fundamental problem in automated standard plane locating during obstetric US diagnosis. Most recent existing recent works employ deep neural network~(DNN) to regress the probe motion. However, these deep regression-based methods leverage the DNN to overfit on the specific training data, which is naturally lack of generalization ability for the clinical application. In this paper, we are back to generalized US feature learning rather than deep parameter regression. We propose a self-supervised learned local detector and descriptor, named USPoint, for US-probe motion estimation during the fine-adjustment phase of fetal plane acquisition. Specifically, a hybrid neural architecture is designed to simultaneously extract a local feature, and further estimate the probe motion. By embedding a differentiable USPoint-based motion estimation inside the proposed network architecture, the USPoint learns the keypoint detector, scores and descriptors from motion error alone, which doesn’t require expensive human-annotation of local features.
The two tasks, local feature learning and motion estimation, are jointly learned in a unified framework to enable collaborative learning with the aim of mutual benefit. To the best of our knowledge, it is the first learned local detector and descriptor tailored for the US image. Experimental evaluation on real clinical data demonstrates the resultant performance improvement on feature matching and motion estimation for potential clinical value.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_11

SharedIt: https://rdcu.be/cVRUR

Link to the code repository

N/A

Link to the dataset(s)

N/A


Reviews

Review #2

  • Please describe the contribution of the paper

    This manuscript presents an end-to-end learning method to obtain the relative 3D pose between source and target ultrasound images. The proposed method first extracts interest points with descriptions from two images, finds the best matches with a graph neural network, and applies singular value decomposition to get the final relative 3D pose between the coordinate system where the IMU sensor is installed during data collection. The paper conducted experiments to demonstrate that their method has better feature matching performance than a hand-craft and a learning-based feature matching method. They also compared this proposed method with a regression-based pose estimation method for motion estimation and showed better performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper organization and writing are clear. The figures are also well made and helpful. I can well understand the proposed method.
    2. The end-to-end pipeline involving several stages is interesting and intuitively should be more generalizable to unseen cases compared with a regression-based pose estimation method. Additionally, the proposed pipeline is explainable, and developers could look into the different stages of the pipeline to figure out the source of errors if degraded performance were to be observed in practice.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The proposed method might not be able to be used in the actual fetus standard plane finding. For this method to work, the image of the standard plane of the scanned subject will need to be acquired in advance. This method is simply a pose estimation method between two ultrasound images and the models do not have a conception of what a standard plane image should look like in the learned weights.
    2. There seem to be some fundamental issues with the design of the method. In the motion estimation module, from my understanding, the authors want to first lift the matched 2D pixel pairs to 3D Euclidean space so that a closed-form point cloud registration method based on singular value decomposition can be used to obtain the 3D relative pose. However, the authors seem to use a learning-based network “Transform Net” to make the conversion from 2D pixel locations to 3D Euclidean locations. Wouldn’t there be a non-learnable way to directly obtain the 3D location of the pixels with respect to the imaging source based on the imaging mechanism of ultrasound? In addition, what the authors present here is to estimate the motion of the point where IMU was installed during data collection instead of the actual ultrasound probe. Should there be a calibration process to bridge this gap? In addition, both problems above may suggest that, with the current design, the learned models may not be able to generalize across different types of ultrasound probes.
    3. The novelty is relatively limited. As the authors pointed out, the design of the local detector and descriptor and the pre-training method follow the design in citation 3. The graph matching method to find feature matching is based on the work named superglue. There are also methods discussing how to handle singular value decomposition differentiably. The main technical novelty seems to be combining these modules into an end-to-end pipeline for the task of relative pose estimation between two ultrasound images.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The dataset will be not publicly available but the source code is promised to be available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    1. In equations (3) and (4), shouldn’t there be no subscript “j” for the symbol “M”? “j” is only used inside the summation.
    2. Consider using the special symbol instead of the italic letter “T” for matrix transpose for all your equations.
    3. Need more information on the motion estimation module. How do you use the Transform Net? Does it take all 2D point pairs and output a single 3x3 matrix “T”? I find the process of predicting a 3x3 matrix to lift 2D pixels and then use a closed-form 3d point cloud registration solution based on SVD problematic. Can the authors explain more about why they choose this process for motion estimation?
    4. End of page 7, “arccos” and “GT” should not be italic. These are texts instead of symbols.
    5. It is important to know the performance of the local detector and descriptor coming out of the first two pre-training phases compared to the one out of the final phase of the proposed end-to-end pipeline. This is because the first two phases, mentioned in the supplementary material, come from other works and the authors need to show that their proposed end-to-end pipeline used in the final phase of the training process further boosts the feature matching performance.
    6. The deep regression method authors compared seems to be published in 2018. Are there no more recent ones with better performance that authors can compare their method against?
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The major factors is the seemingly limited practical value of the proposed method in the application of standard plane finding. Also the proposed method seems to have some fundamental design issues that should be addressed by the authors. The novelty is slightly limited also considering all of the individual modules exist in the previous works.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    Authors addressed some of my main concerns. Please revise the introduction section to include the discussion of the method motivation in the rebuttal as this alone can only estimate the relative probe motion but not completing the fetus standard plane finding if no standard plane can be given by the experienced sonographer. In addition, authors need to change their “self-supervised” claim to “weakly-supervised” because self-supervised learning means all groundtruth data come from the input data that the network will need during the inference time. However, it is not the case in this paper as IMU will not be needed during the inference time. Therefore, “weakly-supervised” should be used instead and I hope AC can help ensure this modification to avoid confusion from the readers.



Review #3

  • Please describe the contribution of the paper

    In this work, the authors propose a method where they can estimate the relative probe motion from a pair of images, using self-supervised keypoint detection and attentional graph NN to match those keypoints between the source and target image pair. They apply this to fetal Ultrasound (US) images with the intention of being able to guide non-expert users to successfully acquire the desired US view-plane. Once probe motion is reliably estimated, relative to an ideally established probe position for that view-plane, instruction can potentially be given to remedy that deviation from the ideal position. In terms of results, they demonstrate that both their features, and the method, is superior to other methods they compare with.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The method proposed by the authors seems fairly new for US imaging. The direct regression methods (of motion parameters) perhaps don’t work well for subtle motions as described here. Essentially the direct regression method is trying to equally optimize for all points in the image (including many noisy, non-informative points), whereas here, it’d learn to optimize only for a select key interesting points. So this sort of approach provides an alternative in these situations.
    • The method doesn’t require annotated keypoints/segmentations for interesting point generation, which is great.
    • Visualizing the keypoints and their matches is a great way to interpret whether or not the method is working - so this is bonus explainability.
    • The data used for training/testing is quite thorough.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The main criticism is that this paper has too much content for a short conference paper - this leads to some parts being not very clearly explained.
    • Although the keypoint detection part doesn’t need any ground truth, you do ultimately use ground truth IMU or simulation based data, to indirectly train it right? Could you comment on how you would make it fully self-supervised? If you apply the transformation predicted on the source image, and calculate loss with the target image, and backpropagate that, would it work?
    • It is clear that you’re solving for subtle linear probe motion. What happens if there’s a large motion and/or there are other compounding non-linear motion such as breathing, cardiac motion, etc?
    • Won’t SVD be an expensive operation to carry out each time? How long does training your algorithm take in the hardware you describe (which seems quite good)?
    • How reliable are IMU signals? They are subject to some noise/drift issues too right?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • I didn’t find reference to code/data etc. I suspect the code may be easily available later but not sure about the data since it seems more custom generated.
    • The components of the method should be generally easy to replicate and test.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • Page 5, attentional graph matching section -> My sense is that ‘intra’ and ‘inter’ are standard terms as opposed to ‘intra’ and ‘extra’ ? Not a big issue though.
    • About the sinkhorn algorithm - how is it executed during training time? Is it a differentiable operation like the SVD? Or is just a callback or something? Could you explain how this works?
    • You write U, V \in SO(3). SO(3) isn’t described - may be beneficial to some readers to define it.
    • The qauternion is also not defined.
    • Your method also seems to suffer from the standard drift issue (larger error for frames that are farther away from the reference frame) right? Do you have any comments/ideas on how to improve that?

    • There’s some language issues here and there affecting clarity. Some obvious ones I caught:
    • page 1, motivation: “locate the probe position” -> However, locating the probe position.
    • page 2, “the main limitation is naturally for lack of generalization ability” -> the main limitation is naturally the lack of generalizability.
    • page 3: “as Fig. 1 shown” -> As Fig 1. shows
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    7

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the work is quite good although the organization is a bit haphazard.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #4

  • Please describe the contribution of the paper

    A feature-based network is for motion estimation of US images is proposed.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Very good novelty in using graph network and SVD alongside the learned features. Good Novelty. Strong fondation compelling results

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The term self-supervised for this work is misleading, I think the method is unsupervised not self-supervised. Unsupervised is used when the network is trained by comparing the first image and the second image whereas, Self-supervised term in motion estimation is used when the network corrects by comparing with its own prediction in a teacher-student fashion.

    The loss function is poorly presented, the authors mentioned groundtruth in loss function equation while, as authors mentioned the training does not require any annotations.

    The authors violated the rule that there shouldent be any text in supplimentary materials.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The details about training, hyper-parameters are not provided

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    The title is very lengthy, please use a shorter title. Page 2, typo error method not methed. I think there is miss-undrestanding for self-supervised training. The author should cite recent unsupervised motion estimation in US imaging including : Tehrani AK, Sharifzadeh M, Boctor E, Rivaz H. Bi-Directional Semi-Supervised Training of Convolutional Neural Networks for Ultrasound Elastography Displacement Estimation. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control. 2022 Jan 27.

    Delaunay R, Hu Y, Vercauteren T. An unsupervised learning approach to ultrasound strain elastography with spatio-temporal consistency. Physics in Medicine & Biology. 2021 Sep 3;66(17):175031.

    and semi-supervised methods: KZ Tehrani A, Mirzaei M, Rivaz H. Semi-supervised training of optical flow convolutional neural networks in ultrasound elastography. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention 2020 Oct 4 (pp. 504-513). Springer, Cham.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There is a good contribution in the paper but a few modification should be made including loss function, the self-supervised term.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    6

  • Reviewer confidence

    Confident but not absolutely certain

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    Although the authors did not fully addressed my concerns, I changed my decision to accept because of the novelty presented in the work and the explanations given in the rebuttal.




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    All three reviewers agree that this is generally a good contribution. R1 seems a little concerned about the practical value of the proposed method. A detailed rebuttal to address the reviewers’ comments would be recommended.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    10




Author Feedback

We thank the reviewers and meta-reviewer for their valuable comments.

Reviewer 2 Practical value: 1) During the training of the trainee sonographer, the high-quality standard plan is stored in advance captured by expert sonographer. The predicted motion guides the trainee to capture the saved standard plan more efficiently to increase their practical scan ability. 2) Kindly note we focus one of key problems rather than a whole system of standard plane acquisition. The learned generalized US feature, USPoint, can be inserted as a feature extraction frontend with a time series model, e.g. LSTM and Transformer, for a series of motion predictions. The USPoint can extract and match the local feature between two adjacent US images to feed the time series model. The predicted motion could guide the non-expert to choose the optimal action to approach the standard plane. This solution doesn’t require a pre-captured high-quality standard plan. 3) The feature-based motion estimation can be also used for 3D US reconstruction using a sequence of 2D US scans for clinical application.

Technical novelty and concern: Our main novelty is to propose the first learned local detector and descriptor, USPoint, tailored for the US image. Although we borrow some SOAT sub-network architectures from the CV literature, the whole hybrid network is original, and elaborately designed for the US-probe data. The USPoint is learned in a self-supervised manner from real clinic data without expensive human annotation. It significantly outperforms both the SOAT geometry and learned local features in CV literature. Moreover, this learned US local feature improves the motion estimation compared with the CNN-encoder-based deep regression method using the whole US image equally including many noisy, non-informative points. Our motion estimation has a more generalized ability because we are back to generalized feature learning rather than black-box deep parameter regression. Yes, the transformation matrix T is learned by Transform Net from the matched interesting point positions. In conventional methods, this transformation matrix can be directly obtained by an elaborate US-probe calibration. We use a small sub-network to learn this matrix from the data. We find it achieves satisfying performance in the experiment. And of course, it can be replaced by a non-learnable way. The generalized ability of motion estimation may slightly decrease due to this small learning part inside. But the learned USPoint is a generalized US local feature. Compared to a conventional dense CNN encoder feature, it filters non-informative points and selects key interesting points, providing more geometric position and description clues. It could be used in other US-based tasks such as US image matching, retrieval and classification.

Reviewer 3 No major concerns. We will improve the paper structure and provide more technical details in the final paper. It is a very constructive suggestion to find a fully unsupervised way to learn the local feature even without IMU signals for future work. I think your suggestion is feasible. Or we can 1) generate a target US image by applying an assigned transformation on the source US image, 2) train the network using a source-generated target US image pair supervised by the assigned transformation.

Reviewer 4 Thanks for suggesting the related papers and we will cite them in the final paper. Loss function and self-supervised: This paper proposes a learned local detector and descriptor tailored for US images. We design a hybrid network to build a relationship between the US local feature learning and probe motion estimation. During training, the expensive human annotation of local features is not required. The network can be trained under the indirect supervision of motion signals captured from an IMU sensor. The GT in the loss function equation refers to the IMU signal. That is the reason why we named it a self-supervised learned local feature.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The reviewers are generally in agreement with the strong novelty in the paper, which makes this a good candidate for MICCAI. Please make sure you address the comments of Reviewer 2 and change the “self-supervised learning” to “weakly-supervised learning”.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal has convincingly addressed the major criticism and R4 has upgraded their review score. The paper can be accepted.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    NR



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have adequately addressed comments about practical value of the work and clarified issues regarding the novelty. They have also noted that the final version of the paper will be updated using the reviewers’ comments. Therefore, I recommend accepting this paper for MICCAI.

    Minor points: Please carefully proofread the paper before publication. There are several typos such as: US abbreviation is only defined in the abstract. It should also be defined in the main text. Page 3: “As Fig. 1 shown” is grammatically incorrect and should be changed to As Fig. 1 shows. Same for “As Fig. 2 shown” etc.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    2



back to top