Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Sidney Fels, Jerry L. Prince, Georges El Fakhri, Jonghye Woo

Abstract

The tongue’s intricate 3D structure, comprising localized functional units, plays a crucial role in the production of speech. When measured using tagged MRI, these functional units exhibit cohesive displacements and derived quantities that facilitate the complex process of speech production. Non-negative matrix factorization-based approaches have been shown to estimate the functional units through motion features, yielding a set of building blocks and a corresponding weighting map. Investigating the link between weighting maps and speech acoustics can offer significant insights into the intricate process of speech production. To this end, in this work, we utilize two-dimensional spectrograms as a proxy representation, and develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms. Our proposed plastic light transformer (PLT) framework is based on directional product relative position bias and single-level spatial pyramid pooling, thus enabling flexible processing of weighting maps with variable size to fixed-size spectrograms, without input information loss or dimension expansion. Additionally, our PLT framework efficiently models the global correlation of wide matrix input. To improve the realism of our generated spectrograms with relatively limited training samples, we apply pair-wise utterance consistency with Maximum Mean Discrepancy constraint and adversarial training. Experimental results on a dataset of 29 subjects speaking two utterances demonstrated that our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models.

Link to paper

DOI: https://doi.org/10.1007/978-3-031-43990-2_41

SharedIt: https://rdcu.be/dnwLW

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #2

Please describe the contribution of the paper

The aim of this study was to investigate how tongue movements relate to speech acoustics by translating weighting maps - which represent the functional units of the tongue - into their corresponding audio waveforms.

The authors propose a deep learning framework, named PLT, for translating weighting maps to corresponding audio waveforms using two-dimensional spectrograms as a proxy representation. PLT is designed to flexibly process variable-sized weighting maps and efficiently model the global correlation of wide matrix input while improving the realism of generated spectrograms with pair-wise utterance consistency and adversarial training.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Easy to read, explains the intuitions behind using the model components.
- Adapted conventional Vision Transformers (ViT) like using directional product relative position bias or single-level Spatial Pyramid Pooling for their task.
- Added utterance consistency in the latent feature space and adversarial training for added supervision and used weighted supervisions as the overall optimization objectives.
- Have presented relevant ablation studies, for example the values of weighted parameters β (beta) and λ (lambda).
- Evaluation result shows sizeable improvement over the baselines. -Evaluation and ablation experiments are detailed to explain the validity and rationality of different parts of the proposed model. Like showing evaluation results for different variations of their proposed method, like with cross-embedding, without pair-wise disentangle, without GAN loss. This helps understand whether those model components are useful or not.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Evaluation was run on data collected for 29 subjects, while performing the speech words “a souk” or “a geese,”. That doesn’t give much confidence about model generalization ability. Will the model perform well for other unseen subjects/ words? Better performance of PLT as compared to baselines could be just because of higher model capacity i.e. numbers of parameters in ViT. It would be interesting to see how it would have worked on some unseen subjects in zero-shot setting. https://www.nature.com/articles/s41597-021-01041-3
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

No code provided, so hard to comment on reproducibility. However, there are theoretical proofs for the algorithms mentioned in the paper.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

It would be interesting to see how it would have worked on some unseen subjects in zero-shot setting. Is it possible to get other benchmark data for proper/standardized evaluation? Does this paper talk about similar dataset? https://www.nature.com/articles/s41597-021-01041-3
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

7
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, it’s a good work, well written paper and explains the validity and rationality of different parts of the proposed model. Had there been more evaluation results, i would have rated it even higher.
Reviewer confidence

Confident but not absolutely certain
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper proposes an end to end framework that generates 2d spectograms by receiving MRI tagget of 2 utterances. The architecture consists of a plastic light transformer (PLT) encoder and a CNN fixed size decoder + a gan discriminator. The PLT extracts the local and global relations of the words (Hglobal and Hlocal) using a token system. In addition they use in the encoder a single level pyramidal pooling that allows them to extract features with fixed size. This allows them to infer with the decoder a spectrogram of the fixed size. This spectrogram is compared with the gt (generated using griffin algorithm) through a GAN network. The training protocol proposes to minimize the discrepancy between the latent features of the same word pronounced by 2 persons, as well as between 2 different words pronounced by the same person. This allows them to train with only 2 words from 29 participants (a small dataset). To this they add data augmentation in the H and wave form matrices.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Its main strength is the extensive and detailed amount of information condensed in 8 pages. Without knowing the subject, the article provides all the references and enough information so that an inexperienced person can learn about the subject and try to reproduce it. In addition to being useful for the understanding of the functioning of the tongue and its role in speech related disorders. They propose data augmentation methods that go beyond noise in the H matrices, randomly cropping a column to it nearest hundred or employing a sliding window for audio wave forms. The extensive ablation studies present the importance of fixing the size once the features have been extracted and not before (e.g. cropped padding or bicubic). They also compare with a CNN and a Vision transformer architectures. The union of Gan + pair wise distangle and cross embedding boost the correlation metric from 0.70 to 0.74. As if that were not enough, they also present a sensitivity analysis of the influence of the assigned values for alpha and beta (gan loss penalty and distangle loss penalty). Choosing beta and alpha as 0.75 and 1 respectively. Their work was pleasant to read and really interesting. I have many positive comments, their work has a coherent scientific motivation.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Its main disadvantage is that being a complex architecture with a token mechanism it seems difficult to implement. So if the code is not released, a reimplementation for comparison as state-of-the-art may take probably arround 6 months to a year. Not counting the acquisition of the dataset that will have to be done on the same words to reproduce the results. Pre-trained weights could ensure at least a correct implementation of the network and allow a first inference on new data.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Seems hard to reproduce, the authors do not specify if they will free the code and neither the images
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Even if the paper was far from my field it contains all the references I needed to understand it. Thank you for this amazing work.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

8
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The problem has a clinical strong motivation. They try to find the correlation between the articulation of the tongue to the speech using MRI and neuronal networks. They do extensive ablation studies, and explain the chose of different hyper parameters. It was very interesting and the research questions seems to be well formulated and well adress.
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #1

Please describe the contribution of the paper

In this paper, the authors propose a new approach for synthesizing speech audio from tagged MRI sequences. This is the first attempt that relates functional units with audio waveforms using intermediate representations. In the framework, a plastic light-transformer is developed to achieve efficient global modeling.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- This work studies an interesting problem that aims to synthesize sounds from tagged MRI, in which the authors make an attempt to relate functional units with audio waveforms using intermediate representations.
- The proposed PLT model is technically sound.
- Ablation studies can validate the effectiveness of adopted modules.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- I have some concerns about the reliability of the results because there is a lack of diversity in the data. The dataset only includes recordings of subjects saying the words “a souk” or “a geese”. I am curious if the model actually learns to accurately recognize the relationship between the input and output speech sounds, or if it’s simply fitting the limited data available.
- The provided samples contain very strong noises and artifacts in both generated and GT sounds.
- The weighting map is different from natural images. Can the authors explain why a vision model can be directly used to process the weighting map for sound synthesis?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors have shared helpful implementation details, but the reproduction of the work is impossible without the release of the dataset and code.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2023/en/REVIEWER-GUIDELINES.html

Please consider comments in Weaknesses.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

4
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My biggest concern is the reliability of the results. Running on a super small dataset with limited diversity makes the results unconvincing.
Reviewer confidence

Not confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

The paper proposes a technically sound PLT model for synthesizing speech audio from MRI sequences. PLT is designed to flexibly process variable-sized weighting maps and efficiently model the global correlation of wide matrix input while improving the realism of generated spectrograms with pair-wise utterance consistency and adversarial training. Evaluation result shows sizeable improvement over the baselines. The strengths include its easy-to-read explanation, relevant ablation studies, and detailed evaluation. However, the lack of diversity in the dataset and noisy samples are weaknesses, besides the model generalization ability is another important issue. Overall, it is considered as a strong paper with minor weakness.

Author Feedback

N/A

back to top

Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix Factorization via Plastic Transformer