Paper Info Reviews Meta-Review Author Feedback Post-rebuttal Meta-Reviews Back to top

List of Papers By topics Author List

Paper Info

Reviews

Meta-review

Author Feedback

Post-Rebuttal Meta-reviews

Authors

Xiaofeng Liu, Fangxu Xing, Jerry L. Prince, Jiachen Zhuo, Maureen Stone, Georges El Fakhri, Jonghye Woo

Abstract

Understanding the underlying relationship between tongue and oropharyngeal muscle deformation seen in tagged-MRI and intelligible speech plays a vital role in advancing speech motor control theories and treatment of speech related-disorders. Because of their heterogeneous representations, however, direct mapping between the two modalities (two-dimensional plus time tagged-MRI sequence and one-dimensional waveform) is not straightforward. Instead, we resort to two-dimensional spectrograms as an intermediate means, covering both pitch and resonance, from which to develop an end-to-end deep learning framework to translate from a sequence of tagged-MRI to its corresponding audio waveform with limited dataset size.~Our framework hinges on a novel fully convolutional asymmetry translator with guidance of a self residual attention scheme to specifically exploit the moving muscular structures during speech.~In addition, a pairwise correlation of the samples with the same utterances is utilized with a latent space representation disentanglement scheme.~Furthermore, an adversarial training approach with generative adversarial networks is incorporated to provide enhanced realism on our generated spectrograms.~Our experimental results, carried out with a total of 63 tagged-MRI sequences alongside speech acoustics, show that our framework enabled the generation of clear audio waveforms from a sequence of tagged-MRI unseen in training, surpassing competing methods. Thus, our framework provided the potential to aid in better understanding the relationship between the two modalities.

Link to paper

DOI: https://link.springer.com/chapter/10.1007/978-3-031-16446-0_36

SharedIt: https://rdcu.be/cVRTv

Link to the code repository

N/A

Link to the dataset(s)

N/A

Reviews

Review #1

Please describe the contribution of the paper

This paper propsoed a novel deep-learning based method to synthesize spectrograms (audio) from tagged-MRI sequences (imaging).
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Authors raised an interesting research topic: transfer imaging data (tagged MRI) to audio data (spectrograms). Specific self-residual attention guided heterogenous translator and utterance disentanglement were designed for this specific task. The proposed method outperformed the available method for similar task.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Details or justifications were missing for operations used in pair-wise disentangle training.

The sample size for the experiments was small. There were only two words investigated in this study, and the number of the data sample was imbalanced.

The performance gain of adding GAN is marginal compared to attention and pair-wise disentangle, taking the computational cost of GAN into account, the involvement of GAN is questionable for such a model. Loss with/without GAN was missing.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

Authors have provided fair information to reproduce the method.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

For pair-wise disentangle training, the authors selected a few channels for the feature to denote the mean and variance. Please explain how these channels were selected and why.

I suggest the authors should also justify the necessity of GAN.

I believe it may be hard to enlarge the dataset used and explore more word samples, even sentences. But it would be more interesting if more samples can be included for this study.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

5
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper raised an interesting research topic and the authors provided a fair solution to this problem. Some details were missing but overall a fair paper to be accepted.
Number of papers in your stack

6
What is the ranking of this paper in your review stack?

1
Reviewer confidence

Somewhat Confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper
1. To our knowledge, this is the first attempt at translating tagged-MRI sequences to audio waveforms.
2. They proposed a novel self residual attention guided heterogeneous translator to achieve efficient tagged-MRI-to-spectrogram synthesis.
3. The utterance and subject factors disentanglement and adversarial training are further explored to improve synthesis performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

They are the first team to try to translate tagged-MRI sequences to audio waveforms and they proposed an efficient fully convolutional asymmetry translator with help of a self residual attention scheme to specifically focus on the moving muscular structures for speech production. And they used a pairwise correlation of the samples with the same utterances with a latent space representation disentanglement scheme. Furthermore, we incorporated an adversarial training approach with GAN to yield improved results on our generated spectrograms. The topic and results are very interesting.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

It’s hard to reproduce, because they did not share the source codes.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

It is negative for the reproducibility of the paper, because they did not share the source codes.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

It is difficult to reappear, but the results are promising for applications.
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper proposed a novel method, and this is the first attempt at translating tagged-MRI sequences to audio waveforms. The results are promising for applications.
Number of papers in your stack

6
What is the ranking of this paper in your review stack?

6
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The authors propose an encoder-decoder (translator) model which is trained in GAN-fashion on pairs of input data to synthesize acoustic mel spectrograms from an MRI sequences of oropharyngeal muscles movement (which corresponds to tongue movement). To only exploit the information that lies in the muscle movement, an additional encoder-decoder FCNN network (residual attention) is trained alongside to filter out static regions in the input frames. The latent space of the encoder-decoder-model is disentangled into utterance-specific and subject-specific latent features, where the utterance-specific part is learned by enforcing prior knowledge via KL-divergence on utterance-matched sample pairs. All modifications seem to improve the performance of the model.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed architecture is sophisticated and tailored to the target application and data format, and the authors present an ablation study which shows the benefits of each part of the proposed model. The approach is novel and interesting, the results seem consistent, and the paper is well written.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The proposed model is trained with a total of 63 tagged-MRI sequences which is a very limited sample size. It would be interesting to have a discussion how this method can scale up.

The differences to the competing architecture, Lip2AudSpect, could be explained better. It could also be explained better how Lip2AudSpect was adapted to MRI data.

The authors did not mention the research field of speech generation from real-time ultrasound which is very related.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Note, that authors have filled out a reproducibility checklist upon submission. Please be aware that authors are not required to meet all criteria on the checklist - for instance, providing code and data is a plus, but not a requirement for acceptance

The authors provide sufficient information to reimplement the proposed architecture.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review: https://conferences.miccai.org/2022/en/REVIEWER-GUIDELINES.html

The authors should include a discussion how the model would scale up with an increased dataset size.

The reconstructed samples are very distorted which is probably (partly) caused by the use of the Griffin-Lim algorithm for waveform reconstruction. Maybe the authors can discuss more recent approaches, e.g. MelGAN [1].

The authors should the implementation of the baseline method, Lip2AudioSpec, in more detail.

Basic information about training, e.g. batch size, should be included in the manuscript.

The “self-trained attention network” seems to only generate masks by blurring the residual frames. Why is this complicated approach chosen, and could a simple technique maybe even improve the performance (e.g. filling the area of white pixels in the binary residual frame with simple computer vision techniques)?

The authors should discuss the very related field of generating speech from real-time ultrasound images, e.g. [2].

[1] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, Aaron C. Courville, MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis, Advances in Neural Information Processing Systems 32 (NeurIPS 2019), 2019

[2] Jing-Xuan Zhang, Korin Richmond, Zhen-Hua Ling, Li-Rong Dai, TaLNet: Voice reconstruction from tongue and lip articulation with transfer learning from text-to-speech synthesis, Title of host publicationProceedings of the AAAI Conference on Artificial Intelligence, 2021
Rate the paper on a scale of 1-8, 8 being the strongest (8-5: accept; 4-1: reject). Spreading the score helps create a distribution for decision-making

6
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a novel and creative approach which could be used to understand the relationship between tongue muscle movement and speech.
Number of papers in your stack

5
What is the ranking of this paper in your review stack?

2
Reviewer confidence

Very confident
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Primary Meta-Review

Please provide your assessment of this work, taking into account all reviews. Summarize the key strengths and weaknesses of the paper and justify your recommendation. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. In case of an invitation for rebuttal, clarify which points are important to address in the rebuttal.

This shows an interesting work and application, with well-presented method and promising results. Reproducibility of the work can be improved.
What is the ranking of this paper in your stack? Use a number between 1 (best paper in your stack) and n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

2

Author Feedback

N/A

back to top

Tagged-MRI Sequence to Audio Synthesis via Self Residual Attention Guided Heterogeneous Translator