Paper Info Reviews Meta-review Author Feedback Post-Rebuttal Meta-reviews

Authors

Shaotong Zhu, Michael Wan, Elaheh Hatamimajoumerd, Kashish Jain, Samuel Zlota, Cholpady Vikram Kamath, Cassandra B. Rowan, Emma C. Grace, Matthew S. Goodwin, Marie J. Hayes, Rebecca A. Schwartz-Mette, Emily Zimmerman, Sarah Ostadabbas

Abstract

We present an end-to-end computer vision pipeline to detect non-nutritive sucking (NNS)—an infant sucking pattern with no nutrition delivered—as a potential biomarker for developmental delays, using off-the-shelf baby monitor video footage. One barrier to clinical (or algorithmic) assessment of NNS stems from its sparsity, requiring experts to wade through hours of footage to find minutes of relevant activity. Our NNS activity segmentation algorithm solves this problem by identifying periods of NNS with high certainty—up to 94.0% average precision and 84.9% average recall across 30 heterogeneous 60 s clips, drawn from our manually annotated NNS clinical in-crib dataset of 183 hours of overnight baby monitor footage from 19 infants. Our method is based on an underlying NNS action recognition algorithm, which uses spatiotemporal deep learning networks and infant-specific pose estimation, achieving 94.9% accuracy in binary classification of 960 2.5 s balanced NNS vs. non-NNS clips. Tested on our second, independent, and public NNS in-the-wild dataset, NNS recognition classification reaches 92.3% accuracy, and NNS segmentation achieves 90.8% precision and 84.2% recall.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43895-0_55

SharedIt: https://rdcu.be/dnwzn

Link to the code repository

https://github.com/ostadabbas/NNS-Detection-and-Segmentation

Link to the dataset(s)

https://github.com/ostadabbas/NNS-Detection-and-Segmentation


Reviews

Review #1

  • Please describe the contribution of the paper

    In this work the authors propose an end-to-end pipeline to detect non-nutritive sucking (NNS) at infants. They propse a system to approach this task including optical flow and pose tracking trained and evaluated on a private and small public dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The problem of NNS is very interesting and novel in the miccai society
    • The analysis of sucking events e.g. frequency (2Hz) is interesting and matches the model setup
    • The small public dataset of NNS clips is helpful to strengthen the interest of this topic in the community
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • No novelty in methodology ResNet + temporal model (LSTM, Bi-LSTM, transformer).
    • Limited comparison and exploration of state of the art activity recognition methods such as X3D or I3D which are capable in encoding motion from RGB directly
    • dataset is hidden and public part is really small and suffers from domain shift to hidden part
    • dataset and evaluation are flawed as the evaluation is done on an balanced dataset of an unbalanced event. A test set with even longer videos e.g. 1h or clips without any NNS but only non-NNS could be used to address this.
    • the comparison of the window modes: Tiled, Sliding and Smoothed do not add deeper insights
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance
    • training dataset is hidden.
    • code will be made public.
    • small portion of data will be made public.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html
    • Method figure is not very easy to understand. The sliding window and N-frame overlap is not adequately visualised and is confusing. The terms Low-dimensional Representation Module and Dynamic Event Classification are not used in the text. Frame-based Preprocessing Module can be found as Preprocessing module.
    • dataset is balanced even though the NNS and non-NNS are imbalanced events. Both 2.5s and 60s clips where balanced on both datasets. However for the actual scenario the non-NNS events will appear far more often, which is not represented in the dataset. This is not a problem for training but can be for evaluation. I suggest to evaluate the segmentation on the full clips e.g. 1h or the entire sleeping time of the infant as this is the actual usecase.
    • In the ablation optical flow and RGB is compared and the results of RGB, Transformer, in-the-wild (48.9) are even lower than random guess (on a balanced dataset that would be 50% acc). To me this looks like there was no learning possible on CNN in this setup. In different studies the difference between RGB, Optical Flow and 2-stream have been discussed before (Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, Joa ̃o Carreira† Andrew Zisserman†,∗ 2017). In general the difference between RGB and FLow shouldn’t be as big. Researchers are actually favouring RGB only networks for action classification achieving excellent results (X3D, Feichtenhofer).
    • The ablation on the different windowing methods adds only little further insights. There are different ways to deal with this as the authors propse weighting efficiency (larger windows no overlap) vs. accuracy. But it is really hard in this scenario to understand the best setup.
    • The combination of 2D CNN + Temporal modeling has the disadvantage that it has no build in way to analyze motion directly. To understand motion directly in the network 3D models such as X3D (3D CNN) have been used as they can learn to compare the information on a pixel level between different time steps. A comparison to 3D models (CNN or Transformer) would have been very insightful and might produce better results as CNN + Temp. Optical flow is encoding motion explicitly hence the analysis of a single flow frame already includes at least 2 RGB frames.
    • One of the Tags of this paper is: Temporal Convolution. However no temporal convolutions are used in this work. The combination of CNN + LSTM (or Transformer) is not a temporal convolution. (“sequence to a temporal convolution network for spatiotemporal processing”)
  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    3

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In this paper a binary action classification task is proposed on an interesting clinical problem of NNS detection. The methodology is not novel and I lack the comparison of more recent architectures to directly understand motion e.g. 3D CNNs. The dataset is captured in a balanced way for a task that is highly unbalanced this comparison therefore has limited clinical value as the testing should be done for the actual scenario.

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    4

  • [Post rebuttal] Please justify your decision

    the authors addressed many crucial points in the rebuttal. The application is interesting but there is only limited novelty in methodology and the evaluation could be stronger. The dataset is the biggest contribution of this work but it is really small and might be unsuitable for a definite answer if NNS can be understood better by the proposed pipeline.



Review #3

  • Please describe the contribution of the paper

    In this manuscript, the authors describe an end-to-end computer vision pipeline to detect non-nutritive sucking (NNS). Using off-the-shelf baby monitors, they detect NNS, which is a developmental delay biomarker. Since NNS are sparse events, identifying them manually requires multiple hours of work for only a few events.
    The study includes recording and labeling a dataset as well as developing the end-to-end algorithm.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Address an important task of follwoing Non-nutritive sucking create two new datasets

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It seems like a lose fit to the topics of MICCAI Activity classification and action segmentation are well-known tasks in the general video understanding community and surgical data. However, the authors did not mention these studies, and it seems that they started from scratch with their network selection. For example, many studies use I3D as a backbone network, and this network includes both image data as well as optical flow. It isn’t always the best option, yet it should be mentioned as a reference point.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    From what I undestand the main dataset they created will not be public. SInce it inlucdes that faces of babies, recorded by the research team, it is understandable it will not be made public. Yet this will make it harder to compete with the results presented by this study.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    From what I understood you use 6 videos. That is 6280 ?= 1600, what am I missing? It was not mentioned why only 6 videos were used. Four models were used, are they pretrained?

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    4

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Multiply studies have addressed action classification and detection. Yet the authors did not refer to them. To the best of my understanding, they choose their own networks without justifying of claiming they are a better fit for their current work. This leaves the central novel aspect of this work in the clinical domain of NNS. While this seems like an important task, I am not sure it is closely related to the MICCAI domain

  • Reviewer confidence

    Very confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    5

  • [Post rebuttal] Please justify your decision

    I had two main issue with the manuscript. First the topic of of NNS seemed to me like a lose fit to MICCAI. Since both of the other reviewers thought it was a good fit, I’ll agree with them. Second, the authors performed activity recognition. My issue was that in their writing the authors gave a feeling that they are starting from scarh not using previous networks developed for this domain. In the rebutel they explained that they did test other approaches however, they were not successful in their task. I think this is an important point that must be mentioned in the paper since it might provided the work broader impact. They mention they think that since only a small part of the image moves (the mouth) perahps different models are needed. Assuming they will address these issues in the updated version, I will change my overall opinion to weak regect.



Review #4

  • Please describe the contribution of the paper

    The authors propose an action classification algorithm for a long-tailed NNS problem and introduce a novel dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors mentioned their code and dataset will be public, which I believe will advance the field of NNS action recognition.
    2. The proposal handles a real-world action recognition application by Conv-LSTM.
    3. Face tracker and LK optical flow are introduced to crop smoothed face regions, which do help to the following training process.
    4. The details of proposed dataset are well-introduced.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors applied LK optical flow to their application. Recently, DNNs-based optical flow estimation method such as RAFT ( Teed, Zachary, and Jia Deng. “RAFT: Recurrent All-Pairs Field Transforms for Optical Flow.” Computer Vision–ECCV 2020) can be faster than traditional optical flow method might worth a try.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

    The authors mentioned they would release their code and a part of the datasets. I do not have further worries on the reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

    Since this paper focus more on application side, it would be better for the authors to talk more about the medical background for NNS. For example, what information can we obtain if we have an accurate NNS segment and how do this information related to SIDS.

  • Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

    6

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors develop the first infant video datasets for NNS and will make it public. The paper also introduce a pipeline for preprocessing these infant video.

  • Reviewer confidence

    Somewhat confident

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Primary Meta-Review

  • Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

    The paper is well-written and focuses on a very interesting and important task of detecting non-nutritive sucking, while introducing two new datasets. However, more discussion of related works is necessary (such as I3D), since the tasks of activity classification and action segmentation are widely used in the medical imaging community for different tasks. In the same vein, additional justification is needed about the methodology used and why it is the best fit for this task. Furthermore, comparison with methods such as X3D or I3D which are capable of encoding motion from RGB directly would be able to showcase the strengths of the proposed methodology. Finally, it would strengthen the results if evaluation of the method was provided on full clips of the entire sleeping time of an infant.




Author Feedback

We thank the reviewers for their careful reading and thoughtful comments.

  1. NNS dataset size, domain shift, access R1 points out that our main dataset is private, the public dataset is small, and there is a domain shift between them. R3 asks why our main dataset only has 6 subjects and about the # of clips. Our main dataset was developed by an interdisciplinary team of clinical psychologists and ML engineers, with IRB-approved recruitment of 25 candidates over one year, and hundreds of hours of scientifically-informed annotation. Due to uncontrollable infant behavior, only 6 infants performed enough NNS with pacifier use (see Tbl S1) for ML purposes, providing 6x2x80 = 960 clips in total–the misreported 1600 figure was the size before selection and balancing. We think this is a useful first ML contribution to the important study of NNS as a neurodevelopmental signal. We developed the smaller public dataset to verify our method’s robustness to potential domain shifts due to camera/environmental changes, so any such shift and the fact that our method evaluates well in both are positive points. We are also in the process of collecting more NNS data, this time with full IRB permission for public release.

  2. NNS segmentation dataset balance, length R1 argues that our NNS test “dataset and evaluation are flawed as the evaluation is done on a balanced dataset of an unbalanced event,” and that it would be better to test on hours-long videos with sparser NNS. Our NNS in-crib test set has a 71:29 balance of non-NNS: NNS duration, and our precision-recall-based evaluation helps ensure the integrity of our test under class imbalance. It would be more informative to test on longer datasets, but while our hundreds of hours of behavioral coding yielded high interrater reliability on the ~10 s level (Tbl S1), properly testing the precision on the ~1 s level required further selection for higher precision, as described in Sec. 4.1. Because NNS is a rare event, we needed to overrepresent it relative to its natural occurrence in order to ensure meaningful statistical results. We will clarify these issues in our revision.

  3. Method: Prior work R3 argues that we do not mention other work on action segmentation and classification and instead build from scratch; they ask whether our model is pretrained. We do draw inspiration from existing work for our spatiotemporal network design but limited our discussion due to lack of space as many of the methods are well established. We will rectify this in the revision. Using pretrained ResNet + training LSTM from scratch yielded better results, possibly due to the unique characteristics of NNS movements.

  4. Method: RGB, optical flow (OF), I3D R3 suggests comparing with I3D, which combines RGB and OF features; R4 suggests using the RAFT DNN-based OF method; and R1 adds that our results should not have shown a big improvement from OF over RGB features. In addition to our formally reported OF vs RGB tests (Tbl 1), we did internal testing with a number of OF pipelines (including Farneback, TV-L1, and RAFT), and with a number of fusion pipelines (including fine-tuning I3D and X3D on our NNS dataset), and found them to be ineffective in qualitative and quantitative results. This could be due to the smaller training set and/or the difficulty of detecting NNS from movements of a small number of pixels near the mouth, vs large-scale limb movements in other action datasets.

  5. Method: Spatiotemporal networks R1 suggests that we test with 3D CNN models like X3D to capture motion better and that we should not use the term “temporal convolution” for our CNN + LSTM; R3 suggests testing with other action segmentation networks. We tested with 3D CNNs and found that they could not learn, possibly due to a large number of parameters (~10x more), the relatively small training dataset (~5x smaller than 50 Salads), and the subtleness and uniqueness of NNS movements. We will fix the terminology.




Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The paper is interesting and well-written. The application is novel and interesting for the MICCAI community. The authors addressed critical points in the rebuttal and the paper can substantially be improved with minor corrections for the camera-ready version, specifically extending the related work to include ID3 and your previous efforts that were ineffective in qualitative and quantitative results due to potentially limited datasets size.



Meta-review #2

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    strengths: new application and new dataset weaknesses: limited technical novelty, evaluation could be more complete; dataset is relatively small; the problem of non-nutritive sucking action detection is relatively narrow and may not find interest in the majority MICCAI audience

    The rebuttal provides some info, but does not influence my decision here in a major way. Perhaps this is due to the rather rigid length limitation.



Meta-review #3

  • Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

    The Meta-R appreciates the efforts made on addressing many crucial points in the rebuttal. However, some critical concerns still remain, such as the limited method novelty, insufficient evaluation and small scale dataset. Therefore, I recommend rejection.



back to top