Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

He Zhao, Qingqing Zheng, Clare Teng, Robail Yasrab, Lior Drukker, Aris T. Papageorghiou, J. Alison Noble

Abstract

Video quality assurance is an important topic in obstetric ultrasound imaging to ensure that captured videos are suitable for biometry and fetal health assessment. Previously, one successful objective approach to automated ultrasound image quality assurance has considered it as a supervised learning task of detecting anatomical structures defined by a clinical protocol. In this paper, we propose an alternative and purely data-driven approach that makes effective use of both spatial and temporal information and the model learns from high-quality videos without any anatomy-specific annotations. This makes it attractive for potentially scalable generalisation. In the proposed model, a 3D encoder and decoder pair bi-directionally learns a spatio-temporal representation between the video space and the feature space. A zoom-in module is introduced to encourage the model to focus on the main object in a frame. A further design novelty is the introduction of two additional modalities in model training (sonographer gaze and optical flow derived from the video). Finally, our approach is applied to identify high-quality videos for fetal head circumference measurement in freehand second-trimester ultrasound scans. Extensive experiments are conducted, and the results demonstrate the effectiveness of our approach with an AUC of 0.911.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16440-8_22

SharedIt: https://rdcu.be/cVRvN

Link to the code repository

https://github.com/IBMEOX/UltrasoundVQA

Link to the dataset(s)

N/A


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors developed a 3D encoder and decoder pair bi-directional model that identifies ultrasound images of the fetal head that are suitable for making measurement. The model does this by learning a spatio-temporal representation between the video space and the feature space.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The model appears novel and the method may make the measurements of the fetus more reproducible and less dependent on the operator. The analysis appears sound and the results are convincing.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some information about the dataset is missing Statistical analysis of the results is missing

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Appears to be complete

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    It is not clear what makes the three types of images to be of low quality. Is it because they are the wrong planes for making the measurements?

    It is not clear how you partitioned the images in the training and testing datasets, i.e., did images from a given subject appear in both the training and testing dataset?

    Figure 2: Can you add meaning of the labels of the columns to the caption.

    Figure 2: It is not clear in this figure what makes the images to have high-quality and low-quality. Perhaps a short description in the caption would help. You described the difference between the TVP and TCP, but that is not obvious. Perhaps arrows on the image to relate to the description.

    Was the study performed with Institutional Ethical Review Board approval?

    Since the labeling of the images is dependent on the frozen frame, who did that?

    You had a total of 611 video clips. Was each clip from a different subject, i.e., did you image 611 fetuses?

    Tables and figure 3: The differences between the reported performances is clear, but it is not clear if they are statistically significant different.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The usefulness of the model, which makes fetal ultrasound more reproducible and less operator dependent.

  • Number of papers in your stack

    4

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Somewhat Confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    6

  • [Post rebuttal] Please justify your decision

    The authors addressed my comments, but the paper is still not a “strong accept”.



Review #2

  • Please describe the contribution of the paper

    In this paper, the authors describe an objective task-based approach to the quality assessment of clinical ultrasound video, which relies on learning a latent spatio-temporal representation from two modalities (ultrasound video and optical flow). A third modality is also included, i.e. gaze, which helps the model focus on regions of interest in high-quality videos. The proposed model aims at automatically assessing the diagnostic quality of fetal ultrasound exams. Videos containing the transventricular plane (TVP) were considered as the high-quality references for data-driven/unsupervised learning. The method uses the feature reconstruction error to discriminate between low and high-quality videos.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Research motivations and its clinical importance are clearly stated
    • Scalability of the method
    • The method does not require anatomical annotations for training, which is an important improvement over state-of-the-art methods
    • A strong validation of the results is provided, with extensive result comparison and ablation studies, which support the proposed methodology
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Some important details about the proposed model, e.g. parameters, and the used dataset are not given
    • Code/datasets were not provided. Thus, given the missing information above, reproducibility of these results is hindered
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    Although the authors state that the code is provided, this is not the case. The used dataset is not public and thus is not provided. However, the authors should provide a bit more information on image acquisition/origin.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
    • In section 2, the authors state: “The definition of quality assessment in ultrasound is different in that it needs to factor in clinical context”. This is not entirely true, as quality assessment for ultrasound content may also focus on image clarity and definition, for example, to provide insight in equipment/technology development. Consider rewriting this sentence to accommodate this.

    • The final sentence in section 3 (before 3.1) and the one immediately before the “Spatial zoom-in module” subsection could be written just once. They are repeating the same idea. Also, in section 4, the second and third sentences could be condensed into one.

    • The parameters of the Farneback algorithm and the median filter should be described. The same applies to the fully connected layers in D_v

    • In section 4, the loss weights should be defined as w_adv, w_rec, and w_gaze, following the notation in Eq. (1).

    • How exactly was the image-based approach implemented? Was this done with the model proposed by the authors? If that is case, it is not clear how a single image input would be processed. This should be further explained.

    – The paper needs to be thoroughly proofread. Some minor writing issues include:

    • Abstract, line 8: end sentence after “temporal information”. Begin new sentence with “The model…”
    • Abstract, line 9: “anatomy-specific annotations, which makes…”
    • Introduction, line 2: “free radiation” -> “acquisition process, which does not use radiation”.
    • Introduction, line 9: add comma after “labour-intensive”
    • Page 2, line 1: “spaces”
    • Page 2, line 3: add comma after “error”
    • Section 2, line 2: add comma after “proposed”
    • Section 2, line 15: “considers”
    • Section 2, line 16: add comma after “gain)”
    • Section 2, line 22: “checks (if) images”
    • Section 2, line 23: add comma after “[16]”
    • Page 2, last line: “limit” -> “limits”
    • Consider renaming section 3 as “Methods” or “Methodology”
    • Subsection “Spatial zoom-in module”, line 5: “on (the) overall”
    • Subsection “Bi-directional reconstruction”, last sentence: “The structure of the discriminator DV is similar to that of encoder”
    • Page 6, line 8: “exemplar” -> “example”
    • Subsection “Quantitative results”, line 5: add comma before “thus”
    • Avoid repetitions such as “clinical quality for clinical tasks”
    • Page 6, last two lines: add comma after “single-modality video reconstruction”
    • Subsection “Ablation study”, line 4: “achieved by (the) inclusion”
    • Subsection “Gaze prediction”, line 2: remove “can”
    • Subsection “Gaze prediction”, line 7: “approximate” -> “approximately”
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors propose an interesting method, which achieved good results on this particular case, and has great potential for a more widespread application across different medical imaging modalities. The overall score suffered with the lack of clarity on some important details.

  • Number of papers in your stack

    3

  • What is the ranking of this paper in your review stack?

    1

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered



Review #3

  • Please describe the contribution of the paper

    The authors presents an unsupervised approach for the quality assessment of ultrasound fetal head images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well written and easy to follow. The proposed approach for unsupervised training seems interesting. The experimental results are acceptable.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The size of training dataset is quite small. Details regarding dataset is missing. It is not clear how robust the method is to input domain shifts. The clinical application of the proposed method seems to be minor.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    As the method includes too many details and the model has several parts, the results would only be reproducible if the code and data were available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

    There are two major issues with the current submission as follows: 1) The size of the dataset is very small. While the details of dataset and imaging settings have not been mentioned, I think only 300 video clips for training a 3D model with several components is quite insufficient. Please explain the detail of your stopping criterion because your method includes adversarial training. In Fig. 1 (b), it is also unclear how the label is created to calculate quantitative indexes. Please also explain whether cross-validation and data augmentation is used in the training step or not. 2) The clinical application of the proposed method seems to be minor because the method is computationally expensive. I would think that the clinicians can easily look at the images to see whether TVP exists or not. It is not something hard to distinguish or user-dependent or even the user does not need to be very experienced. I would think the task is not challenging enough.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The robustness of the proposed method as well as clinical applications are questionable.

  • Number of papers in your stack

    5

  • What is the ranking of this paper in your review stack?

    2

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Not Answered

  • [Post rebuttal] Please justify your decision

    Not Answered




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper describes an automatic method to assess the quality of ultrasound videos destined for biometry. The approach relies on a spatio-temporal modeling, an unsupervised reconstruction task, and an adversarial loss. Video and optical flow, but more originally gaze, are considered to guide the training process. Experimental validation was deemed convincing (R1), strong (R2), and acceptable (R3). Other advantages raised by reviewers include low-annotation requirements, scalability, and applicability to other modalities/tasks.

    Points to address in the rebuttal and the revised version are: -Clarify the methodological novelty of the paper (R1 states the method “seems novel” and R2 that is interesting”, but no major novelty has been underlined. -Describe the dataset in more detail (R1,R3): ethical review board, number of subjects, split, frozen frame collection, acquisition, origin, …
    -What do learning curves say about the learnability of the proposed 3D model from the studied dataset? Overfitting? Variability? (R3) -Are the performance differences significant? (R1) -Although R1 and R2 find the clinical motivation adequate, R3 argues it is not sufficiently clear. Discuss. -R1 also mentions that the actual definition of high and low-quality videos is not clear. Clarify. -Is code going to be provided?. -Include the implementation and optimization details (R2, R3): hyperparameters, stopping criteria, augmentation, crossvalidation, etc. -Describe how data (a single frame? or video?) is processed during inference time.

  • What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    5




Author Feedback

We appreciate the reviewers for judging that our paper is “novel”, “clear” and “convincing”, as well as for the constructive criticism. We shall address the major concerns as follows.

Q1: Clarify the methodological novelty (M-1) A: Our method is the first unsupervised video-based clinical quality assessment method, which is a significant improvement (R2) since no anatomical annotation is needed. A bi-directional reconstruction-based anomaly detection pipeline is first proposed to learn informative spatio-temporal representation between the video and feature spaces. Besides, two additional modalities are subtly integrated with auxiliary modules to improve the performance.

Q2: The definition of clinical video quality (R1;M-6), clinical motivation (R3;M-5) A: High-quality video clips contain good planes for a specific measurement (TVP), whilst low-quality videos do not. The automated quality assessment task has been studied in recent works [7,8,10,14]. Clinically, it’s important to ensure acquired video is fit for diagnostic decision making. Auto QA has the potential to improve the clinical workflow as users do not need to manually check quality and to support trainees in scanning. Our approach alleviates the time-consuming quality check and makes the measurements less dependent on operators. Furthermore, it provides a general idea to assess clinical video quality for other tasks.

Q3: Details of dataset (R1;R3;M-2), data preparation (M-9) A: This study is approved by an Ethical Review Board. The full-length obstetric videos are recorded at 30 Hz by a GE Voluson E8 scanner. Gaze is recorded by Tobii Eye Tracker 4C simultaneously. Totally 430 subjects are involved in our dataset with 1008x784 resolution. During a scan, an experienced sonographer finds and freezes a biometry plane. The video clip is labeled by the frozen frame type, eg, TVP, TCP, ACP. An aTVP video clip is collected 5-7s before the frozen TVP frame. We collect 430 high-quality TVP video clips (one clip per subject) and 181 low-quality clips. We use the frozen frame and 2s before freezing for training and test. For training, 300 HQ (TVP) video clips are randomly selected, and the rest 130 HQ and 181 LQ clips are used for test. Each input sample consists of 8 frames sampled from 2s video clips at an 8-frame interval and is further resized to 256x256, which is the “raw video” shown in Fig.1.

Q4: Implementation details (R2;R3;M-3,8) A: Hyperparameters for loss weights have been provided, w_adv, w_rec, w_gaze are 1, 10, 0.1, respectively (wrong typos in the paper as L_*). The bottleneck feature size is 1024, and D_F consists of 6 FC layers with neurons from 64 to 1. We choose the window size as 3 for the Farneback optical flow algorithm and the kernel size as 21 in the median filter to reduce the effect of speckles when generating the flow map. The train/test samples are randomly split, and we train our model 5 times instead of cross validation without any augmentation. Following CycleGAN, the model is trained for 200 epochs; the learning rate is set to 0.0002 and linearly decays to 0 in the last 100 epochs. The code will be released once the paper is accepted. The comparative image-based approach is a degraded variant of our approach, which takes only the frozen frame as input and uses a similar architecture but a 2d version.

Q5: Model robustness (R3;M-7), performance significance (R1;M-4) A: We have run our model 5 times and reported the avg. performance (mean+-std in Table 1). The loss stops decreasing when meeting the stop criterion, while the perturbation on the test (avg. AUC is 0.906 over flip/rotate/contrast/noise; 0.911 no perturbation) indicates no overfitting. The above results demonstrate the robustness of our model with our moderate dataset. Besides, we further conduct paired t-test between our method and [17]. The p-value is 8e-5«0.05, indicating the statistical significant benefit of our approach.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The rebuttal has answered to most of the questions (novelty, robustness, performance significance, and dataset description). I support the acceptance of this paper given the clinical pertinence (automation of video quality evaluation is a realistic CAD application), the originality brought by the gaze data, and the quality of the experimental validation.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    I thank the authors for their effort in addressing the questions raised by the reviewers. I encourage the authors to incorporate their answers on Q3 details of dataset, Q4 implementation details and Q5 robustness and statistical significance into the final paper.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    4



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The authors have addressed the main questions in their rebuttal. Overall, the paper presentation seems to be clear, the proposed method novel and generally applicable (unsupervised). The performed t-test shows significant improvements over SOTA. For me, the work is convincing enough to be accepted and should be of interest for the MICCAI community.

  • After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

    Accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    3



back to top