Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Matthias Seibold, Armando Hoch, Mazda Farshad, Nassir Navab, Philipp Fürnstahl

Abstract

In this work, we propose a novel data augmentation method for clinical audio datasets based on a conditional Wasserstein Generative Adversarial Network with Gradient Penalty (cWGAN-GP), operating on log-mel spectrograms. To validate our method, we created a clinical audio dataset which was recorded in a real-world operating room during Total Hip Arthroplasty (THA) procedures and contains typical sounds which resemble the different phases of the intervention. We demonstrate the capability of the proposed method to generate realistic class-conditioned samples from the dataset distribution and show that training with the generated augmented samples outperforms classical audio augmentation methods in terms of classification performance. The performance was evaluated using a ResNet-18 classifier which shows a mean Macro F1-score improvement of 1.70% in a 5-fold cross validation experiment using the proposed augmentation method. Because clinical data is often expensive to acquire, the development of realistic and high-quality data augmentation methods is crucial to improve the robustness and generalization capabilities of learning-based algorithms which is especially important for safety-critical medical applications. Therefore, the proposed data augmentation method is an important step towards improving the data bottleneck for clinical audio-based machine learning systems.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_33

SharedIt: https://rdcu.be/cVRW8

Link to the code repository

https://rocs.balgrist.ch/en/open-access/

Link to the dataset(s)

https://rocs.balgrist.ch/en/open-access/

Reviews

Review #1

Please describe the contribution of the paper

The paper presents a GAN based method for augmentation of clinical audio data by synthesizing log-mel spectrograms. Furthermore, the paper demonstrates that the augmented data can help to improve the classification of various clinical actions flow by using a resnet-18 classification model.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper presents a new direction. Overall, well organized and well presented. The flow is easy to follow.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

No comparison with other methods (the authors claim there are none?) Clinical application is not very clear. and so it is not clear what motivates the work.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No code provide. But the authors have promised to make it public.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The authors have presented an interested applications of GANs for clinical audio data. However, the data collected in this work is through a controlled process. And has been manually edited/segmented to extract audio segments for each action. In a real clinical settings, the audio data will be a lot different. And it will involve human conversations, ambient noise etc. The authors did not provide any insights on this or how this can be addressed.

Did the authors consider using SSIM too for comparison?

Check for typo in conclusion: “can to improve…”
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The topic is interesting. The approach is easy to follow and will encourage good debate in the conference.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

3
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

6
[Post rebuttal] Please justify your decision

I believe the authors have already explained the contribution. Other reviewers also pointed to limited novelty. However, the contribution of the dataset is clear and so, I would recommend for acceptance.

Review #4

Please describe the contribution of the paper

The paper introduces a new clinical audio dataset, which records the sounds of different surgical phases during the Total Hip Arthroplasty (THA) procedures in the real-world operating room. A GAN-based model equipped with Wasserstein loss and Gradient Penalty is proposed to enlarge the dataset. Based on the augmented dataset, the method shows the improvement on the phase classification task, compared with other data augmentation strategies.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Introducing a new audio dataset recording the sound of surgical action
- Propose to enlarge the dataset with GAN
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- Limited method novelty with just applying the GAN method to do the augmentation
- The experiments are not that sufficient, which only can be stated as some preliminary results for this new topic.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Good for the reproducibility, as they promise to release the dataset and code
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
- The main weakness is the novelty of proposed method. It seems that just applying GAN techniques with Wasserstein loss for generating clinical audio, while do not consider the unique properties of clinical audio when designing the method. -As shown in Fig.1, ‘Suction’ seems always being overlap with other phases, therefore, how do you define the recordings for ‘Suction’ class? Do they also belong to other classes?
- How many samples for each class do you generate when doing the comparison in Table 1? If only doubling for each class, why you say the generation can help tackle the class imbalance problem?
- The results on all five folds are better to be shown, rather than only the average results.
- Actually, from the Fig.3, it seems that the generated samples are not well aligned with the GT in each class. Maybe you could show the samples generated by other GAN-based approaches for comparison, to validate the effectiveness of your method.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Weak novelty of proposed method, as well as the insufficient experimental results
Number of papers in your stack

6
What is the ranking of this paper in your review stack?

4
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

4
[Post rebuttal] Please justify your decision

I have carefully read the response from the authors. However, I am still not convinced by the method novelty. In their response, they also said they ‘ ‘combines’ recent advances in synthetic data generation …’. Additionally, regarding the experiments, the two important components are not mentioned or added, i.e., the results in each fold, and the comparison of other GAN-based methods. Therefore, I am leaning to the rejection.

Review #5

Please describe the contribution of the paper

The authors present a Wasserstein GAN based method for augmentation of audio based classification tasks for medical applications. The authors show an improvement in classification performance of a classifier trained with the data augmented with the GAN.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors present a novel method for augmenting audio datasets in medical applications.
- The authors show improved performance over existing techniques
- The analysis of audio signals is a potentially rich source of insight in medicine that is not well researched.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The dataset analysed is very small, therefore the generalizability of the method is limited.
- The authors do not well describe and justify the used of the the dataset being analyzed. What is the purpose of analyzing the THA audio dataset?
- The description of the results needs to be improved. Only mean accuracy is presented, which is very limiting in terms of interpretation.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors state that they plan to release the code upon publication. The methods are reasonably described so that a scientist with the code could reproduce the paper. The dataset is small, it is not that well described, so reproducing these results may be challenging without the dataset.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html
Abstract
- The authors should make the primary use case of this approach more clear. Is this meant for A/V surgical annotation of the stage, is it meant to guide interventions, is it mean for training? It is even unclear what the nature of the sounds that will be analysed are, is it the speech of the surgical team, the heart beat of the patient, or the pitch of the drill? It is unclear what labelling the dataset collected in an automated fashion will be useful for. Introduction
- The reason for analysing the THA procedure is unclear. The authors successfully make the case that there are many useful reasons to analyse audio in medical application, drilling sounds as guidance in ortho procedures, lungs sounds as a diagnostic measure. However the authors need to point out why they are working with the audio dataset that they are. It is not clear what this dataset is useful for or if it is a good surrogate for other audio datasets. Methods
- The dataset seems small (N=5), which would likely affect generalizability. Do the crossfolds split based on the different recordings, 5 folds, 5 different recordings?
- What is the frame rate of the recordings? L=16380, how many seconds does this correspond with?
- Why is the spectrogram distribution and the dataset recording distribution differ? The percentage of samples that are suction in the spectrogram data is far smaller than in the recording dataset.
- How is the data split for training the GAN vs training the classifier? This could be made more clear Figure 2
- A more thorough description in the caption is needed. It should be made clear in the description that this is the GAN. The input should be made more clear, the spectrogram output should be more clear.
  Results
- The results section should state the main result, rather than only refer to the figures and tables.
- Why is mean accuracy reported, rather than F1 for the multiclass classification? Some sample confusion matrices would be helpful in interpreting the results. It is unclear from only accuracy which classifier is best.
  Discussion
- The authors should not comment on the method improving robustness or generalization because neither of these things were investigated in this study.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- Novelty of the approach
- The dataset was small and not well justified
- The results are promising but could have been better described
Number of papers in your stack

7
What is the ranking of this paper in your review stack?

6
Reviewer confidence

Somewhat Confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

5
[Post rebuttal] Please justify your decision

The main area of novelty remains around augmentation of audio-signals an area that has not been well studied the main weaknesses remain, limited data, limited experiments, limited justification, and preprocessing of audio signalling limiting the generalizability.

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The author proposed a clinical audio dataset that records the sounds of different surgical phases during the Total Hip Arthroplasty. A GAN based technique is used for data augmentation to improve the performance of classification. The reviewers have raised several concerns about the motivation, novelty, data description and experiments. I invite the authors to submit the rebuttal focusing on addressing reviewers comments highlighting the motivation of using audio data, novelty of this work and justifying why only limited experiments were done. Additionally, it was highlighted by the reviewers that the dataset is very small. In such scenario, how can generalisation be achieved? The use of audio data is not clear. And more experimental details should have been included.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

8

Author Feedback

Dear reviewers and editors,

Thank you for positive and supportive (R1, R5), and constructive (R1, R4, R5) reviews and the opportunity to provide this feedback.

First, we want to address the feedback of R4 regarding the technical novelty of the proposed work. As illustrated in the introduction section, deep learning based analysis of acoustic signals in the medical domain is becoming an increasingly growing research direction. Hereby, data limitations are very common in the clinical context and the lack of high quality data augmentation methods is an often described problem for the application of state-of-the-art learning-based systems. As confirmed by R1 and R5, the technical novelty lies in the proposed augmentation method, which combines recent advances in synthetic data generation (conditional convolutional GAN), improved GAN losses (Wasserstein Loss with Gradient Penalty), and state-of-the-art feature representations in learning-based clinical audio analysis (log-mel spectrograms). Through the application of WGAN with GP regularization and class conditioning, our method is able to generate high-quality and diverse synthetic samples for any given class. While we show that the proposed method outperforms established augmentations for clinical audio in prior work, it would definitely be valuable to also evaluate other GAN-based approaches for audio augmentation described outside of the clinical field. We added a sentence mentioning this limitation in the discussion section and adapted the respective paragraph in the introduction to better point out the technical novelty.

For the evaluation, we report mean and standard deviation (std) of the per-class accuracies in cross validation (CV), which gives a good estimate of the multiclass classification performance. For better interpretability, we have included the weighted average F1 score over all folds (mean and std), as suggested by R5.

Furthermore, we want to address the comments raised by R1 and R5 about the clinical motivation of the proposed dataset which is used as an exemplary application for the proposed augmentation method. The main goal of the proposed dataset is to lay the foundation for exploiting audio signals for surgical phase recognition (SPR). SPR is usually accomplished through the analysis of video which is a large field of research, especially for endoscopic video in minimally invasive procedures. For open surgery, e.g. many orthopedic procedures, the OR is a hectic environment and camera views are often blocked by people and equipment. Therefore, we believe that acoustic signals could be a valuable alternative for phase detection, as they are easy-to-integrate, rich in information, and do not have blocking overlay characteristics. As this topic is only rudimentarily explored in prior work, we propose an audio dataset captured from one of the most frequent surgeries in orthopedics, THA. We have rephrased parts of the introduction to make this clearer. While we agree that the dataset is not yet enabling fully automated SPR in a real world scenario (no speech and other background noise), we emphasize that this work serves as a scientific proof-of-concept and should not be considered a fully optimized and product-ready system. Recording this data requires access to the OR and specialized equipment installed which is challenging and time-intensive, especially under Covid conditions. We consider the dataset size in combination with CV to be sufficient for a strong proof-of-concept study and added a statement about the limited dataset size (which is in accordance with the MICCAI reviewer guidelines for CAI papers) in the discussion section. For reproducibility and because there is no comparable dataset for open surgery publicly available, we will make the dataset public upon acceptance.

We furthermore thank the reviewers for the constructive suggestions about the experimentation and added a paragraph in the manuscript to provide all requested details.

Post-rebuttal Meta-Reviews

Meta-review # 1 (Primary)

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The authors have addressed the main concerns raised by the reviewers. Although technical novelty remains limited but the dataset remains the main contribution. Clinical motivation has been clarity. Such a data is valuable contribution to our scientific community and will allow moving further in using audio as an additional modality for activity/surgical workflow recognition.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

6

Meta-review #2

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The reviewers still have major concerns about the novelty and the performance evaluation of the proposed method. The main contribution of this work is just the dataset.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Reject
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

8

Meta-review #3

Please provide your assessment of the paper taking all information into account, including rebuttal. Highlight the key strengths and weaknesses of the paper, clarify how you reconciled contrasting review comments and scores, indicate if concerns were successfully addressed in the rebuttal, and provide a clear justification of your decision. If you disagree with some of the (meta)reviewer statements, you can indicate so in your meta-review. Please make sure that the authors, program chairs, and the public can understand the reason for your decision.

The paper presents a new clinical audio dataset, and a framework for data augmentation of clinical audio data based on GANs. The topic is of interest and the paper is well-written and well-presented. The concerns of the reviewers around the validation experiments and the method novelty remain, however given the contributions of the paper I recommend acceptance.
After you have reviewed the rebuttal, please provide your final rating based on all reviews and the authors’ rebuttal.

Accept
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

10

back to top

Conditional Generative Data Augmentation for Clinical Audio Datasets