Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Szymon Płotka, Michal K. Grzeszczyk, Robert Brawura-Biskupski-Samaha, Paweł Gutaj, Michał Lipa, Tomasz Trzciński, Arkadiusz Sitek

Abstract

Predicting fetal weight at birth is an important aspect of perinatal care, particularly in the context of antenatal management, which includes the planned timing and the mode of delivery. Accurate prediction of weight using prenatal ultrasound is challenging as it requires images of specific fetal body parts during advanced pregnancy which is difficult to capture due to poor quality of images caused by the lack of amniotic fluid. As a consequence, predictions which rely on standard methods often suffer from significant errors. In this paper we propose the Residual Transformer Module which extends a 3D ResNet-based network for analysis of 2D+t spatio-temporal ultrasound video scans. Our end-to-end method, called BabyNet, automatically predicts fetal birth weight based on fetal ultrasound video scans. We evaluate BabyNet using a dedicated clinical set comprising 225 2D fetal ultrasound videos of pregnancies from 75 patients performed one day prior to delivery. Experimental results show that BabyNet outperforms several state-of-the-art methods and estimates the weight at birth with accuracy comparable to human experts. Furthermore, combining estimates provided by human experts with those computed by BabyNet yields the best results, outperforming either of other methods by a significant margin. The source code of BabyNet is available at https://github.com/SanoScience/BabyNet.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_34

SharedIt: https://rdcu.be/cVRvZ

Link to the code repository

https://github.com/SanoScience/BabyNet

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper
- The authors propose an end-to-end method which called BabyNet for birth weight estimation based directly on fetal ultrasound video scans.
- The authors design a novel residual transformer module by adding temporal position encoding.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- An end-to-end birth weight estimation method based directly on fetal ultrasound video scans.
- The BabyNet is trained and validated with data acquired one day prior to delivery.
- The experiment results are shown to be competitive compared to SOTA on 225 2D fetal scans from 75 pregnant women.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The network is not innovative enough. The proposed network only extends the temporal encoding based on the BoT.
- Some details about the dataset and experiment is ignored (see detailed and constructive comments).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The code is publicly available, but the dataset is not.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- The network is not innovative enough. The proposed network only extends the temporal encoding based on the BoT.
- What is the sweep mode of the fetal video used for trained and validated? Is linear or sector scan?
- The mean frame number of scan videos is 852, while the temporal sequences input to BabyNet have only 16 frames. Would better performance be obtained if the temporal sequences are longer?
- The authors describe in the Discussion section that the way to combine BabyNet with clinicians results is to take an average, which is best described in the Experiments section.
- In the penultimate paragraph on page 6, “Clinicians (this work) BabyNet” should be “Clinicians (this work) & BabyNet”.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- An end-to-end birth weight estimation method is proposed and the proposed method is trained and validated with data acquired one day prior to delivery.
- The experimental results of the proposed method is superior to the SOTA methods, but lacked innovation in the methodology.
Number of papers in your stack

4
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The paper proposes a hybrid neural network called BabyNet, for automatically predicting fetal birth weight based on fetal ultrasound video scans.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

BabyNet efficiently bridges CNNs and transformers for end-to-end estimations of fetal weights from US videos. It avoids the high computational complexity of pure transformers and also allows local and global feature learning.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Descriptions about the data processing mode of the BabyNet lack detail.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Authors have provided their source code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
1. Section 2.3 Images should be uniformly partitioned into patches (tokens) before they are input into transformer layers. What is the patch size in the proposed RTM architecture?
2. Section 2 What is the loss function adopted in the proposed method?
3. Section 3, Implementation Details According to authors, acquired ultrasound images size 960×720 or 852×1136 pixels. Input video frames size 64×64 in height×width. If I understand it correctly, lots of spatial information would be lost during the image resize process?
4. Section 3, Implementation Details The input sequence to the BabyNet sizes 16 frames in the length. The number of frames in a US video is about 852 frames. Do the sequences from the same video overlap?
5. Figure 1 It seems that one weight number is predicted for a sequence with 16 frames. A US video has about 852 frames. Then, does one US video scan correspond to multiple predicted weight numbers?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

BabyNet efficiently bridges CNNs and transformers for end-to-end estimations of fetal weights directly from US videos.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #4

Please describe the contribution of the paper

The paper describes a method to estimate birth weight of fetus from ultrasound scans performed one day prior to delivery. The architecture (BabyNet) includes a residual transformer module in a 3D Resnet-based network (hybrid CNN and transformer) to analyze ultrasound videos. It is evaluated using 225 fetal ultrasound videos from 75 patients. It is compared with state-of-the-art methods ad human experts.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper addresses an interesting problem to estimate birth weight of the fetus from ultrasound videos.
- Experimental results show that the proposed method outperforms state-of-the-art methods and is comparable human experts. Combining human and AI outperforms individual methods.
- The paper is well-written and organized.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The clinical significance of the paper is unclear, and further explanation of how it could be useful in clinical practice could be included. The authors state that “Accurate prediction of FBW is critical in determining the best method of delivery (natural or Cesarean).” however this statement needs appropriate citations and more detail, e.g., Is this decision solely based on fetal birth weight or other factors are also considered? Is this widely adopted in practice? When is the decision made (the paper evaluates videos only from 1 day before delivery)?
- The paper does not discuss the interpretability aspect of the proposed machine learning model. Quantitatively the results look promising compared to existing architectures, however, it is not clear what makes this architecture unique and suitable for the given problem. Do the authors analyze how the model makes the decision qualitatively? Which visual features does this model learn to make the accurate estimation? Where were the highest and lowest attended parts of the video for the MHSA layers?
- The method was evaluated on a dataset on 225 videos. In general, vision transformers require large-scale datasets for training and suffer from overfitting problems on smaller datasets. The paper does not mention this aspect and how was it overcome (e.g., regularization and augmentation).
- Future directions of the work are missing in the paper.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors have provided source code at anonymized Github repository. They state that the dataset and pre-trained models will be made available after acceptance.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- The authors may provide more details and citations about the clinical significance of the proposed work. Is this decision of natural/cesarian delivery solely based on fetal birth weight or other factors are also considered? Is this widely adopted in practice? Is this decision usually 1 day before birth i.e., from when these videos are analyzed?
- The model is trained using 225 videos. In general, vision transformers are trained using large-scale datasets and have shown to suffer from overfitting with smaller datasets. Was any overfitting observed as the data seems to be small-scale. How was the model regularized in this case?
- How was the network or its components initialized?
- Length of the video or frame rate could be provided (currently only total number of frames are provided)
- What is the computational complexity of the different compared models? This could be additionally mentioned in Table 2.
- It is not clear how clinicians obtain their estimates of the fetal weights. Do they use only the ultrasound video or additional patient metadata?
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper addresses an interesting problem to estimate fetal birth weight from ultrasound videos. A novel architecture combining CNNs and vision transformer module is proposed. However, some points still remain unclear, e.g. interpretability of proposed models, overfitting on smaller datasets, and potential to use the method in clinical practice
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The authors propose a neural network for automatically predicting fetal birth weight from fetal ultrasound video scans. The paper addresses an interesting problem and is well-written and organized. Please, take into accout the comments of the reviewers to prepare your submission.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

NR

Author Feedback

We would like to thank all the reviewers for their insightful comments and constructive suggestions. Below we address the main concerns raised, among predominantly positive feedback.

Reviewers 1 (R1) and 3 (R3) pointed out the lack of clarity regarding the input of 16 frames to the BabyNet and the prediction made on all frames from US videos. The number of 16 frames was chosen during the hyperparameter optimization process. Each US video is divided into non-overlapping 16-frame segments and a patient-level prediction is obtained by averaging all segment predictions. The frames are resized to 64x64 to decrease computational complexity and reduce noise as the quality of fetal videos acquired one day before delivery is poor due to lack of amniotic fluid. We will describe it more thoroughly in the camera-ready.

R1 asked about the sweep mode of the fetal video. We use a sector scan of three fetal body parts (head, abdomen and femur).

Regarding the remark by R3 about the division into patches before passing the feature maps into the MHSA layer we do not use patches in our MHSA. We follow the original implementation of BoT [1] where queries, keys and values are embedded using 1x1 pointwise convolutions.

Reviewer 4 (R4) asked for clarification regarding the clinical significance of our solution. Fetal birth weight (FBW) is a significant indicator of perinatal health prognosis. Currently, FBW is estimated on the basis of fetal biometric measurements of body organs - head circumference (HC), biparietal diameter (BPD), abdominal circumference (AC), and femur length (FL), which are used as the input to heuristic formulae. Unfortunately, such an approach is prone to high error. With BabyNet this error can be decreased. We are going to clarify this in the camera-ready paper.

We initialize network weights with the default PyTorch strategy (the 3D convolutions were initialized with the Kaiming strategy).

To prevent overfitting (R4 comment) we applied data augmentation (as described in the paper), a relatively small learning rate with step decay and other parameters optimized using a grid search.

As the loss function (R3), we use Mean Squared Error (MSE) as mentioned in the Implementation details section.

Future work (R4) includes testing BabyNet on external datasets which are preferably acquired using different devices and by operators with different levels of experience is of high interest. Also, we plan to use multimodal data – combine the fetal US video and clinical data to improve the performance of the model.

[1] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani, “Bottleneck Transformers for Visual Recognition,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, Jun. 2021, pp. 16514–16524. doi: 10.1109/CVPR46437.2021.01625.

back to top

BabyNet: Residual Transformer Module for Birth Weight Prediction on Fetal Ultrasound Video